%md ## Examining the Contents of a JSON file JSON is a common file format used in big data applications and in data lakes (or large stores of diverse data). File formats such as JSON arise out of a number of data needs. For instance, what if: <br> * Your schema, or the structure of your data, changes over time? * You need nested fields like an array with many values or an array of arrays? * You don't know how you're going use your data yet, so you don't want to spend time creating relational tables? The popularity of JSON is largely due to the fact that JSON allows for nested, flexible schemas. This lesson uses the `DatabricksBlog` table, which is backed by JSON file `dbfs:/mnt/training/databricks-blog.json`. If you examine the raw file, notice it contains compact JSON data. There's a single JSON object on each line of the file; each object corresponds to a row in the table. Each row represents a blog post on the <a href="https://databricks.com/blog" target="_blank">Databricks blog</a>, and the table contains all blog posts through August 9, 2017.
Examining the Contents of a JSON file
JSON is a common file format used in big data applications and in data lakes (or large stores of diverse data). File formats such as JSON arise out of a number of data needs. For instance, what if:
- Your schema, or the structure of your data, changes over time?
- You need nested fields like an array with many values or an array of arrays?
- You don't know how you're going use your data yet, so you don't want to spend time creating relational tables?
The popularity of JSON is largely due to the fact that JSON allows for nested, flexible schemas.
This lesson uses the DatabricksBlog table, which is backed by JSON file dbfs:/mnt/training/databricks-blog.json. If you examine the raw file, notice it contains compact JSON data. There's a single JSON object on each line of the file; each object corresponds to a row in the table. Each row represents a blog post on the Databricks blog, and the table contains all blog posts through August 9, 2017.
Last refresh: Never
databricksBlogDF.printSchema()
%md Run a query to view the contents of the table. Notice: * The `authors` column is an array containing one or more author names. * The `categories` column is an array of one or more blog post category names. * The `dates` column contains nested fields `createdOn`, `publishedOn` and `tz`.
Run a query to view the contents of the table.
Notice:
- The
authorscolumn is an array containing one or more author names. - The
categoriescolumn is an array of one or more blog post category names. - The
datescolumn contains nested fieldscreatedOn,publishedOnandtz.
Last refresh: Never
display(databricksBlogDF.select("authors","categories","dates","content"))
| ["Tomer Shiran (VP of Product Management at MapR)"] | ["Company Blog","Partners"] | {"createdOn":"2014-04-10","publishedOn":"2014-04-10","tz":"UTC"} | <div class="post-meta">This post is guest authored by our friends at MapR, announcing our new partnership to provide enterprise support for Apache Spark as part of MapR's Distribution of Hadoop.</div> <hr /> With over 500 paying customers, my team and I have the opportunity to talk to many organizations that are leveraging Hadoop in production to extract value from big data. One of the most common topics raised by our customers in recent months is Apache Spark. Some customers just want to learn more about the advantages of this technology and the use cases that it addresses, while others are already running it in production with the MapR Distribution. These customers range from the world’s largest cable telcos and retailers to Silicon Valley startups such as Quantifind, which recently talked about its use of Spark on MapR in an <a href="http://www.datameer.com/ceoblog/big-data-brews-with-erich-nachbar/" target="_blank">interview</a> with Stefan Groschupf, CEO of Datameer. Today, I a... |
| ["Tathagata Das"] | ["Apache Spark","Engineering Blog","Machine Learning"] | {"createdOn":"2014-04-10","publishedOn":"2014-04-10","tz":"UTC"} | We are happy to announce the availability of <a href="http://spark.apache.org/releases/spark-release-0-9-1.html" target="_blank">Apache Spark 0.9.1</a>! This is a maintenance release with bug fixes, performance improvements, better stability with YARN and improved parity of the Scala and Python API. We recommend all 0.9.0 users to upgrade to this stable release. This is the first release since Spark graduated as a top level Apache project. Contributions to this release came from 37 developers. Visit the <a href="http://spark.apache.org/releases/spark-release-0-9-1.html" target="_blank">release notes</a> for more information about all the improvements and bug fixes. <a href="http://spark.apache.org/downloads.html" target="_blank">Download</a> it and try it out! |
| ["Steven Hillion"] | ["Company Blog","Partners"] | {"createdOn":"2014-04-01","publishedOn":"2014-04-01","tz":"UTC"} | <div class="post-meta">This post is guest authored by our friends at Alpine Data Labs, part of the 'Application Spotlight' series highlighting innovative applications that are part of the Databricks "Certified on Apache Spark" program.</div> <hr /> Everyone knows how hard it is to recruit engineers and data scientists in Silicon Valley. At <a href="http://www.alpinenow.com" target="_blank">Alpine Data Labs</a>, we think what we’re up to is pretty fun and challenging, but we still have to compete with other start-ups as well as the big internet companies to attract the best talent. One thing that can help is to be able to say that you’re working with the most innovative and powerful technologies. Last year, I was interviewing a talented engineer with a strong background in machine learning. And he said that the one thing he wanted to do above all was to work with Apache Spark. “Will I get to do that at Alpine?” he asked. If it had been even a year earlier, I would have said “Sure…at... |
| ["Michael Armbrust","Reynold Xin"] | ["Apache Spark","Engineering Blog"] | {"createdOn":"2014-03-27","publishedOn":"2014-03-27","tz":"UTC"} | Building a unified platform for big data analytics has long been the vision of Apache Spark, allowing a single program to perform ETL, MapReduce, and complex analytics. An important aspect of unification that our users have consistently requested is the ability to more easily import data stored in external sources, such as Apache Hive. Today, we are excited to announce <a href="https://spark.apache.org/docs/latest/sql-programming-guide.html">Spark SQL</a>, a new component recently merged into the Spark repository. Spark SQL brings native support for SQL to Spark and streamlines the process of querying data stored both in RDDs (Spark’s distributed datasets) and in external sources. Spark SQL conveniently blurs the lines between RDDs and relational tables. Unifying these powerful abstractions makes it easy for developers to intermix SQL commands querying external data with complex analytics, all within in a single application. Concretely, Spark SQL will allow developers to: <ul> <li>I... |
| ["Patrick Wendell"] | ["Apache Spark","Engineering Blog"] | {"createdOn":"2014-02-04","publishedOn":"2014-02-04","tz":"UTC"} | Our goal with Apache Spark is very simple: provide the best platform for computation on big data. We do this through both a powerful core engine and rich libraries for useful analytics tasks. Today, we are excited to announce the release of Apache Spark 0.9.0. This major release extends Spark’s libraries and further improves its performance and usability. Apache Spark 0.9.0 is the largest release to date, with work from 83 contributors, who submitted over 300 patches. Apache Spark 0.9 features significant extensions to the set of standard analytical libraries packaged with Spark. The release introduces GraphX, a library for graph computation that comes with implementations of several standard algorithms, such as PageRank. Spark’s machine learning library (MLlib) has been extended to support Python, using the NumPy numerical library. A Naive Bayes Classifier has also been added to MLlib. Finally, Spark Streaming, which supports near-real-time continuous computation, has added a simplif... |
| ["Ali Ghodsi","Ahir Reddy"] | ["Apache Spark","Ecosystem","Engineering Blog"] | {"createdOn":"2014-01-02","publishedOn":"2014-01-02","tz":"UTC"} | Apache Hadoop integration has always been a key goal of Apache Spark and <a href="http://hortonworks.com/wp-content/uploads/2013/06/YARN.png">YARN</a> users have long been able to run <a href="http://spark.incubator.apache.org/docs/latest/running-on-yarn.html">Spark on YARN</a>. However, up to now, it has been relatively hard to run Spark on Hadoop MapReduce v1 clusters, i.e. clusters that do not have YARN installed. Typically, users would have to get permission to install Spark/Scala on some subset of the machines, a process that could be time consuming. Enter <a href="http://databricks.github.io/simr/">SIMR (Spark In MapReduce)</a>, which has been released in conjunction with <a href="https://databricks.com/blog/2013/12/19/release-0_8_1.html">Apache Spark 0.8.1</a>. SIMR allows anyone with access to a Hadoop MapReduce v1 cluster to run Spark out of the box. A user can run Spark directly on top of Hadoop MapReduce v1 without any administrative rights, and without having Spark or Scal... |
| ["Russell Cardullo (Data Infrastructure Engineer at Sharethrough)","Michael Ruggiero (Data Infrastructure Engineer at Sharethrough)"] | ["Company Blog","Customers"] | {"createdOn":"2014-03-26","publishedOn":"2014-03-26","tz":"UTC"} | <div class="post-meta">We're very happy to see our friends at Cloudera continue to get the word out about Apache Spark, and their latest blog post is a great example of how users are putting Spark Streaming to use to solve complex problems in real time. Thanks to Russell Cardullo and Michael Ruggiero, Data Infrastructure Engineers at <a href="http://engineering.sharethrough.com/">Sharethrough</a>, for this <a href="http://blog.cloudera.com/blog/2014/03/letting-it-flow-with-spark-streaming/">guest post on Cloudera's blog</a>, which we've cross-posted below</div> <hr /> At Sharethrough, which offers an advertising exchange for delivering in-feed ads, we’ve been running on CDH for the past three years (after migrating from Amazon EMR), primarily for ETL. With the launch of our exchange platform in early 2013 and our desire to optimize content distribution in real time, our needs changed, yet CDH remains an important part of our infrastructure. In mid-2013, we began to examine stream-ba... |
| ["Jai Ranganathan","Matei Zaharia"] | ["Apache Spark","Engineering Blog"] | {"createdOn":"2014-03-21","publishedOn":"2014-03-21","tz":"UTC"} | <div class="post-meta"> This article was cross-posted in the <a href="http://blog.cloudera.com/blog/2014/03/apache-spark-a-delight-for-developers/">Cloudera developer blog</a>. </div> <a href="http://spark.apache.org/">Apache Spark</a> is well known today for its <a href="http://blog.cloudera.com/blog/2013/11/putting-spark-to-use-fast-in-memory-computing-for-your-big-data-applications/">performance benefits</a> over MapReduce, as well as its <a href="http://blog.cloudera.com/blog/2014/03/why-apache-spark-is-a-crossover-hit-for-data-scientists/">versatility</a>. However, another important benefit — the elegance of the development experience — gets less mainstream attention. In this post, you’ll learn just a few of the features in Spark that make development purely a pleasure. <h2>Language Flexibility</h2> Spark natively provides support for a variety of popular development languages. Out of the box, it supports Scala, Java, and Python, with some promising work ongoing <a href="http:/... |
| ["Databricks Press Office"] | ["Announcements","Company Blog"] | {"createdOn":"2014-03-19","publishedOn":"2014-03-19","tz":"UTC"} | <strong>BERKELEY, Calif. – March 18, 2014 –</strong> Databricks, the company founded by the creators of Apache Spark that is revolutionizing what enterprises can do with Big Data, today announced the Databricks <a href="/certification/">“Certified on Spark” Program</a> for applications built on top of the Apache Spark platform. This program ensures that certified applications will work with a multitude of commercially supported Spark distributions. “Pioneering application developers that are leveraging the power of Spark have had to choose between two sub-optimal choices: they either have to package Spark platform support with their application or attempt to maintain integration/certification individually with a rapidly increasing set of commercially supported Spark distributions,” said Ion Stoica, Databricks CEO. “The Databricks ‘Certified on Spark’ program enables developers to certify solely against the 100% open-source Apache Spark distribution, and ensures interoperability with A... |
| ["Ion Stoica"] | ["Apache Spark","Engineering Blog"] | {"createdOn":"2014-03-03","publishedOn":"2014-03-03","tz":"UTC"} | <div class="blogContent"> We are delighted with the recent <a href="https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces50">announcement</a> of the Apache Software Foundation that <a href="http://spark.apache.org">Apache Spark</a> has become a top-level Apache project. This is a recognition of the fantastic work done by the Spark open source community, which now counts over 140 developers from 30+ companies. In short time, Spark has become an increasingly popular solution for numerous big data applications, including machine learning, interactive queries, and stream processing. Spark now is an integral part of the Hadoop ecosystem, with many organizations employing Spark to perform sophisticated processing on their Hadoop data. At Databricks we are looking forward to continuing our work with the open source community to accelerate the development and adoption of Apache Spark. Currently employing the lead developers and creators of many of the components... |
| ["Ahir Reddy","Reynold Xin"] | ["Apache Spark","Engineering Blog"] | {"createdOn":"2014-02-13","publishedOn":"2014-02-13","tz":"UTC"} | The AMPLab at UC Berkeley, with help from Databricks, recently released an update to the <a href="https://amplab.cs.berkeley.edu/benchmark/">Big Data Benchmark</a>. This benchmark uses Amazon EC2 to compare performance of five popular SQL query engines in the Big Data ecosystem on common types of queries, which can be reproduced through publicly available scripts and datasets. In the past year, the community has invested heavily in performance optimizations of query engines. We are glad to see that all projects have evolved in this area. Although the queries used in the benchmark are simple, we are proud that Shark remains one of the fastest engines for these workloads, and has improved significantly since the last run. While this benchmark reaffirms Shark as a highly performant SQL query engine, we are working hard at Databricks to push the boundaries further. Stay tuned for some exciting news we will share soon with the community. <ul> <li><a href="https://amplab.cs.berkeley.edu/b... |
| ["Pat McDonough"] | ["Company Blog","Events"] | {"createdOn":"2014-02-11","publishedOn":"2014-02-11","tz":"UTC"} | The Databricks team is excited to take part in a number of activities throughout the 2014 O’Reilly Strata Conference in Santa Clara. From hands-on training, to office hours, to several talks (including a keynote), there are plenty of chances for attendees to learn how Apache Spark is bringing ease of use and outstanding performance to your big data. The schedule for the Databricks team includes: <ul> <li><a href="http://ampcamp.berkeley.edu/4/">AMPCamp4</a>, Hosted at Strata</li> <li><a href="http://strataconf.com/strata2014/public/content/office-hours">Office Hours</a> on Wednesday at 5:45pm</li> <li><a href="http://strataconf.com/strata2014/public/schedule/detail/33057">How Companies are Using Spark, and Where the Edge in Big Data Will Be</a>, a keynote talk presented by Matei Zaharia on Thursday at 9:15am</li> <li><a href="http://strataconf.com/strata2014/public/schedule/detail/32375">Querying Petabytes of Data in Seconds with BlinkDB</a>, co-presented by Reynold Xin on Thur... |
| ["Ion Stoica"] | ["Apache Spark","Ecosystem","Engineering Blog"] | {"createdOn":"2014-01-22","publishedOn":"2014-01-22","tz":"UTC"} | We are often asked how does <a href="http://spark.incubator.apache.org">Apache Spark</a> fits in the Hadoop ecosystem, and how one can run Spark in a existing Hadoop cluster. This blog aims to answer these questions. First, Spark is intended to <em>enhance</em>, not replace, the Hadoop stack. From day one, Spark was designed to read and write data from and to HDFS, as well as other storage systems, such as HBase and Amazon’s S3. As such, Hadoop users can enrich their processing capabilities by combining Spark with Hadoop MapReduce, HBase, and other big data frameworks. Second, we have constantly focused on making it as easy as possible for <em>every Hadoop user</em> to take advantage of Spark’s capabilities. No matter whether you run Hadoop 1.x or Hadoop 2.0 (YARN), and no matter whether you have administrative privileges to configure the Hadoop cluster or not, there is a way for you to run Spark! In particular, there are three ways to deploy Spark in a Hadoop cluster: standalone, YA... |
| ["Patrick Wendell"] | ["Apache Spark","Engineering Blog"] | {"createdOn":"2013-12-20","publishedOn":"2013-12-20","tz":"UTC"} | We are happy to announce the release of Apache Spark 0.8.1. In addition to performance and stability improvements, this release adds three new features. First, Spark now supports for the newest versions of YARN (2.2+). Second, the standalone cluster manager supports a high-availability mode in which it can tolerate master failures. Third, shuffles have been optimized to create fewer files, improving shuffle performance drastically in some settings. In conjunction with the Apache Spark 0.8.1 release we are separately releasing <a href="https://databricks.com/blog/2014/01/01/simr.html">Spark In MapReduce (SIMR)</a>, which enables seamlessly running Spark on Hadoop MapReduce v1 clusters without requiring the installation of Scala or Spark. While Apache Spark 0.8.1 is a minor release, it includes these larger features for the benefit of Scala 2.9 users. The next major release of Apache Spark, 0.9.0, will be based on Scala 2.10. This release was a community effort, featuring contribution... |
| ["Andy Konwinski"] | ["Company Blog","Customers","Events"] | {"createdOn":"2013-12-19","publishedOn":"2013-12-19","tz":"UTC"} | Earlier this month we held the <a href="http://spark-summit.org/2013">first Spark Summit</a>, a conference to bring the Apache Spark community together. We are excited to share some statistics and highlights from the event. <ul> <li>450 participants from over 180 companies attended</li> <li>Participants came from 13 countries</li> <li>Spark training was sold out at 200 participants from 80 companies</li> <li>20 organizations sponsored the event, including all major Hadoop platform vendors</li> <li>20 different organizations gave talks</li> </ul> Videos and slides for all talks are now available on the <a href="http://spark-summit.org/2013">Summit 2013 page</a>. The Summit included Keynotes from Databricks, the UC Berkeley AMPLab, and Yahoo, as well as presentations from 18 other companies including Amazon, Red Hat, and Adobe. Talk topics covered a wide range including specialized applications such as mapping and manipulating the brain, product launches, and research projects... |
| ["Pat McDonough"] | ["Apache Spark","Engineering Blog"] | {"createdOn":"2013-11-22","publishedOn":"2013-11-22","tz":"UTC"} | [sidenote]A version of this post appears on the <a href="http://blog.cloudera.com/blog/2013/11/putting-spark-to-use-fast-in-memory-computing-for-your-big-data-applications/">Cloudera Blog</a>.[/sidenote] <hr/> Apache Hadoop has revolutionized big data processing, enabling users to store and process huge amounts of data at very low costs. MapReduce has proven to be an ideal platform to implement complex batch applications as diverse as sifting through system logs, running ETL, computing web indexes, and powering personal recommendation systems. However, its reliance on persistent storage to provide fault tolerance and its one-pass computation model make MapReduce a poor fit for low-latency applications and iterative computations, such as machine learning and graph algorithms. Apache Spark addresses these limitations by generalizing the MapReduce computation model, while dramatically improving performance and ease of use. <h2 id="fast-and-easy-big-data-processing-with-spark">Fast and ... |
| ["Ion Stoica"] | ["Company Blog","Partners"] | {"createdOn":"2013-10-29","publishedOn":"2013-10-29","tz":"UTC"} | Today, Cloudera announced that it will distribute and support Apache Spark. We are very excited about this announcement, and what it brings to the Spark platform and the open source community. So what does this announcement mean for Spark? First, it validates the maturity of the Spark platform. Started as a research project at UC Berkeley in 2009, Spark is the first general purpose cluster computing engine that can run sophisticated computations at memory speeds on Hadoop clusters. Spark started with the goal of providing efficient support for iterative algorithms (such as machine learning) and interactive queries, workloads not well supported by MapReduce. Since then, Spark has grown to support other applications such as streaming, and has gained rapid industry adoption. Today, Spark is used in production by numerous companies, and it counts on an ever growing open source community with over 90 contributors from 25 companies. Second, it will make the Spark platform available to a wi... |
| ["Matei Zaharia"] | ["Announcements","Company Blog"] | {"createdOn":"2013-10-28","publishedOn":"2013-10-28","tz":"UTC"} | This year has seen unprecedented growth in both the user and contributor communities around <a href="http://spark.incubator.apache.org">Apache Spark</a>. This rapid growth validates the tremendous potential of the platform, and shows the great excitement around it. While Spark started as a research project by a few grad students at UC Berkeley in 2009, today <strong>over 90 developers from 25 companies have contributed to Spark</strong>. This is not counting contributors to Shark (Hive on Spark), of which there are 25. Indeed, out of the many new big data engines created in the past few years, <strong>Spark has the largest development community after Hadoop MapReduce</strong>. We believe that new components in the project, like <a href="http://spark.incubator.apache.org/docs/latest/streaming-programming-guide.html">Spark Streaming</a> and <a href="http://spark.incubator.apache.org/docs/latest/mllib-guide.html">MLlib</a>, will only increase this growth. <h2>Growth by Numbers</h2> To gi... |
| ["Ion Stoica","Matei Zaharia"] | ["Announcements","Company Blog"] | {"createdOn":"2013-10-27","publishedOn":"2013-10-27","tz":"UTC"} | When we announced that the original team behind <a href="http://spark.incubator.apache.org">Apache Spark</a> is starting a company around the project, we got a lot of excited questions. What areas will the company focus on, and what will it mean for the open source project? Today, in our first blog post at Databricks, we’re happy to share some of our goals, and say a little about what we’re doing next with Spark. To start with, our mission at Databricks is simple: we want to build the very best computing platform for extracting value from data. Big data is a tremendous opportunity that is still largely untapped, and we’ve been working for the past six years to transform what can be done with it. Going forward, we are fully committed to building out the open source Apache Spark platform to achieve this goal. <h2 id="how-we-think-about-big-data-speed-and-sophistication">How We Think about Big Data: Speed and Sophistication</h2> In the past few years, open source technologies like Hadoop... |
| ["Arsalan Tavakoli-Shiraji"] | ["Company Blog","Partners"] | {"createdOn":"2014-04-11","publishedOn":"2014-04-11","tz":"UTC"} | Today, MapR announced that it will distribute and support the Apache Spark platform as part of the MapR Distribution for Hadoop in partnership with Databricks. We’re thrilled to start on this journey with MapR for a multitude of reasons. One of our primary goals at Databricks is to drive broad adoption of Spark and ensure everybody who uses it has a fantastic experience. This partnership will enable all of MapR’s enterprise customers, existing and new, to leverage Spark with the backing of the same great enterprise support available for the rest of MapR’s Hadoop Distribution. As Tomer mentioned in his <a href="/blog/2014/04/10/MapR-Integrates-Spark-Stack.html">blog post</a>, Spark is one of the most common topics in discussions with MapR’s existing customers and many are even already running it in production! A core part of Spark’s value proposition is the ability to easily build a unified end-to-end workflow where critical functions are first class citizens that are seamlessly integ... |
| ["Prashant Sharma","Matei Zaharia"] | ["Apache Spark","Engineering Blog"] | {"createdOn":"2014-04-15","publishedOn":"2014-04-15","tz":"UTC"} | One of Apache Spark’s main goals is to make big data applications easier to write. Spark has always had concise APIs in Scala and Python, but its Java API was verbose due to the lack of function expressions. With the addition of <a href="http://docs.oracle.com/javase/tutorial/java/javaOO/lambdaexpressions.html">lambda expressions</a> in Java 8, we’ve updated Spark’s API to transparently support these expressions, while staying compatible with old versions of Java. This new support will be available in Apache Spark 1.0. <h2 id="a-few-examples">A Few Examples</h2> The following examples show how Java 8 makes code more concise. In our first example, we search a log file for lines that contain “error”, using Spark’s <code>filter</code> and <code>count</code> operations. The code is simple to write, but passing a Function object to <code>filter</code> is clunky: <h5 id="java-7-search-example">Java 7 search example:</h5> <pre>JavaRDD<String> lines = sc.textFile("hdfs://log.txt").filter( n... |
| ["Databricks Training Team"] | ["Announcements","Company Blog","Events"] | {"createdOn":"2014-06-02","publishedOn":"2014-06-02","tz":"UTC"} | Databricks is excited to launch its training program, starting with <a title="Spark Training" href="https://databricks.com/training">a series of hands-on Apache Spark workshops</a> designed by the creators of Apache Spark. The first workshop, <em>Introduction to Apache Spark</em>, establishes the fundamentals of using Spark for data exploration, analysis, and building big data applications. This one day workshop is hands-on, covering topics such as: interactively working with Spark's core APIs, learning the key concepts of big data, deploying applications on common Hadoop distributions, and unifying data pipelines with SQL, Streaming, and Machine Learning. Workshops are currently scheduled in New York, San Jose, Austin, and Chicago, with workshops in more cities to come. Visit <a title="Databricks Training" href="https://databricks.com/training">Databricks' training page</a> to find more information and please leave feedback there if you'd like to see a workshop in your area. <ul cla... |
| ["Claudiu Barbura (Sr. Dir. of Engineering at Atigeo LLC)"] | ["Company Blog","Partners"] | {"createdOn":"2014-05-23","publishedOn":"2014-05-23","tz":"UTC"} | <div class="post-meta">This post is guest authored by our friends at <a href="http://atigeo.com">Atigeo</a> announcing the certification of their xPatterns offering.</div> <hr /> Here at <a href="http://atigeo.com/">Atigeo</a>, we are always looking for ways to build on, improve, and expand our big data analytics platform, Atigeo xPatterns. More than that, both our development and product management team are focused on big data and on knowing what is right for our customers: data scientists and application developers at companies who are seeking to make the best possible use of their data assets. So we all stay on the lookout for the most useful, advanced, and best-performing set of technologies available. Apache Spark, for us, was a standout: We could see that making a dramatic performance improvement available to our customers and users would mean that xPattern’s analytics, modeling, and machine learning would be more responsive, and that Spark in xPatterns would give our customer... |
| ["Sarabjeet Chugh (Head of Hadoop Product Management at Pivotal Inc.)"] | ["Company Blog","Partners"] | {"createdOn":"2014-05-23","publishedOn":"2014-05-23","tz":"UTC"} | <div class="post-meta">This post is guest authored by our friends at <a href="http://www.gopivotal.com" target="_blank">Pivotal</a> describing why they’re excited to deliver Apache Spark on their world class Pivotal HD big data analytics platform suite.</div> <hr /> Today, we are excited to announce the immediate availability of the full Apache Spark stack on Pivotal HD. We have been impressed with the rapid adoption of Spark as a replacement for Hadoop’s more traditional processing engines as well as its vibrant ecosystem, and are thrilled to make it possible for Pivotal customers to run Apache Spark on Pivotal HD Hadoop. Just as important is how we’re doing it: Pivotal HD will be part of Databricks’ upcoming certification program – meaning a commitment to provide compatibility with Apache Spark and support the growing ecosystem of Spark applications. <h2>PivotalHD and Spark</h2> Unlike a multi-vendor patchwork of heterogeneous solutions, Pivotal brings together an integrated ful... |
| ["Patrick Wendell"] | ["Apache Spark","Engineering Blog"] | {"createdOn":"2014-05-30","publishedOn":"2014-05-30","tz":"UTC"} | Today, we’re very proud to announce the release of <a title="Spark 1.0.0 Release Notes" href="http://spark.apache.org/releases/spark-release-1-0-0.html">Apache Spark 1.0</a>. Apache Spark 1.0 is a major milestone for the Spark project that brings both numerous new features and strong API compatibility guarantees. The release is also a huge milestone for the Spark developer community: with more than 110 contributors over the past 4 months, it is Spark’s largest release yet, continuing a trend that has quickly made Spark the most active project in the Hadoop ecosystem. <h2>New Features</h2> What features are we most excited about in Apache Spark 1.0? While there are dozens of new features in the release, we’d like to highlight three. <b>Spark SQL</b> The biggest single addition to Apache Spark 1.0 is Spark SQL, a new module that <a title="Spark SQL" href="https://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-data-using-spark-2.html">we’ve previously blogged about</a>... |
| ["Michael Armbrust","Zongheng Yang"] | ["Apache Spark","Engineering Blog"] | {"createdOn":"2014-06-02","publishedOn":"2014-06-02","tz":"UTC"} | With <a title="Announcing Spark 1.0" href="https://databricks.com/blog/2014/05/30/announcing-spark-1-0.html">Apache Spark 1.0</a> out the door, we’d like to give a preview of the next major initiatives in the Spark project. Today, the most active component of Spark is <a title="Spark SQL: Manipulating Structured Data Using Spark" href="https://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-data-using-spark-2.html">Spark SQL</a> - a tightly integrated relational engine that inter-operates with the core Spark API. Spark SQL was released in Spark 1.0, and will provide a lighter weight, agile execution backend for future versions of Shark. In this post, we’d like to highlight some of the ways in which tight integration into Scala and Spark provide us powerful tools to optimize query execution with Spark SQL. This post outlines one of the most exciting features, dynamic code generation, and explains what type of performance boost this feature can offer using queries from a... |
| ["Michael Hiskey (VP at MicroStrategy Inc.)"] | ["Company Blog","Partners"] | {"createdOn":"2014-06-04","publishedOn":"2014-06-04","tz":"UTC"} | <div class="post-meta">This post is guest authored by our friends at <a href="http://www.microstrategy.com" target="_blank">MicroStrategy</a> describing why they're excited to have their platform "Certified on Apache Spark".</div> <hr /> <h2>The Need for Speed</h2> Over the past few years, we have seen Hadoop emerge as an effective foundation for many organizations’ big data management frameworks, but as the volume and varieties of data increase, speed continues to be a challenge. More and more of our customers are embracing Big Data, and the value of their investment is dependent on (and limited by) how quickly they can take data to action. We’ve been listening to our clients to understand how we can innovate to stay ahead of the curve to help solve these challenges. Apache Spark grabbed our attention because it addresses many of the limitations of Hadoop’s traditional functionality. Plus, Spark is simply impossible to ignore. The active, growing community of developers and enterpri... |
| ["Christopher Nguyen (CEO & Co-Founder of Adatao)"] | ["Company Blog","Partners"] | {"createdOn":"2014-06-11","publishedOn":"2014-06-11","tz":"UTC"} | <div class="post-meta">This post is guest authored by our friends at <a href="http://www.arimo.com" target="_blank">Arimo</a> describing why and how they bet on Apache Spark.</div> <hr /> In early 2012, a group of engineers with background in distributed systems and machine learning came together to form Arimo. We saw a major unsolved problem in the nascent Hadoop ecosystem: it was largely a storage play. Data was sitting passively on HDFS, with very little value being extracted. To be sure, there was MapReduce, Hive, Pig, etc., but value is a strong function of (a) speed of computation, (b) sophistication of logic, and (c) ease of use. While Hadoop ecosystem was being developed well at the substrate, there was enormous opportunities above it left uncaptured. <strong>On speed:</strong> we had seen data move at-scale and at enormously faster rates in systems like Dremel and PowerDrill at Google. It enabled interactive behavior simply not available to Hadoop users. Without doubt, we k... |
| ["Databricks Press Office"] | ["Company Blog","Events"] | {"createdOn":"2014-06-12","publishedOn":"2014-06-12","tz":"UTC"} | <ul> <li>Three-Day Event in San Francisco Invites Attendees to Gain Insights from the Leading Organizations in Big Data</li> <li>Keynote Speakers Include Executives from Databricks, Cloudera, MapR, DataStax, Jawbone and More</li> <li>Spark Summit Features Different Tracks for Applications, Development, Data Science and Research</li> </ul> BERKELEY, Calif.--(BUSINESS WIRE)-- Databricks and the sponsors of Spark Summit 2014 today announced the full agenda for the summit, including a host of exciting keynotes and community talks. The event will be held June 30–July 2, 2014, at The Westin St. Francis in San Francisco. Spark Summit 2014 arrives at an exciting time for the Apache Spark platform, which has become the most active open source project in the Hadoop ecosystem with more than 200 contributors in the past year. Now available in all major Hadoop distributions, Spark has fostered a fast-growing community on the strength of its technical capabilities, which make big data... |
| ["Dean Wampler (Typesafe)"] | ["Company Blog","Partners"] | {"createdOn":"2014-06-13","publishedOn":"2014-06-13","tz":"UTC"} | <div class="post-meta">This post is guest authored by our friends at <a href="http://www.lightbend.com" target="_blank">Lightbend</a> after having their Lightbend Activator Apache Spark templates be "Certified on Apache Spark".</div> <hr /> <h2>Apache Spark and the Lightbend Reactive Platform: A Match Made in Heaven</h2> When I started working with Hadoop several years ago, it was frustrating to find that writing Hadoop jobs was hard to do. If your problem fits a query model, then <a title="Hive" href="http://hive.apache.org" target="_blank">Hive</a> provides a SQL-based scripting tool. For many common dataflow problems, <a href="http://pig.apache.org" target="_blank">Pig</a> provides useful abstractions, but it isn't a full-fledged, "Turing-complete" language. Otherwise, you had to use the low-level <a href="http://wiki.apache.org/hadoop/MapReduce" target="_blank">Hadoop MapReduce</a> API. Some third-party APIs exist that wrap the MapReduce API, such as <a href="http://cascading.org... |
| ["Hari Kodakalla (EVP at Apervi Inc.)"] | ["Company Blog","Partners"] | {"createdOn":"2014-06-23","publishedOn":"2014-06-23","tz":"UTC"} | <div class="post-meta">This post is guest authored by our friends at <a href="http://www.apervi.com" target="_blank">Apervi</a> after having their Conflux Director™ application be "Certified on Apache Spark".</div> <hr /> <h2>Big Data on Steroids with Apache Spark</h2> As big data takes center stage in the new data explosion, Hadoop has emerged as one the leading technologies addressing the challenges in the space. As the data processing needs of enterprises are growing newer technologies like Apache Spark have emerged as significant options that consistently offer expanded capabilities for the big data space. As these enterprise needs are met, so is the increased appetite for faster processing, low latency requirements for high velocity data and an iterative demand for processing where leading technologies like Hadoop fall short of expectations or at times seem cumbersome to implement due to its inherent design. Delivering on this growing need of enterprises is where Spark plays a ... |
| ["Bill Kehoe (Big Data Architect at Qlik)"] | ["Company Blog","Partners"] | {"createdOn":"2014-06-24","publishedOn":"2014-06-24","tz":"UTC"} | <div class="post-meta">This post is guest authored by our friends at <a href="http://www.qlik.com" target="_blank">Qlik</a> describing how Apache Spark enables the full power of QlikView, recently Certified on Apache Spark, and its Associative Experience feature over the entire HDFS data set.</div> <hr /> <h2>The Power of Qlik</h2> Qlik provides software and services that help make understanding data a natural part of how people make decisions. Our product, QlikView, is the leading Business Discovery platform that incorporates a unique, associative experience that empowers business users to follow their own path to formulate and answer questions that lead to better decisions. Traditional, query-based BI tools force users thru pre-defined navigation paths which limit the kinds of questions that can be answered and require costly and time consuming revisions to address evolving business needs. In contrast, when a user selects data items using QlikView, all the fields and charts are imm... |
| ["Databricks Press Office"] | ["Announcements","Company Blog"] | {"createdOn":"2014-06-26","publishedOn":"2014-06-26","tz":"UTC"} | <em>Certified distributions maintain compatibility with open source Apache Spark distribution and thus support the growing ecosystem of Apache Spark applications</em> <hr /> <strong>BERKELEY, Calif. -- June 26, 2014 --</strong> Databricks, the company founded by the creators of Apache Spark, the next generation Big Data engine, today announced the <a href="https://databricks.com/spark/certification/certified-spark-distribution" target="_blank">“Certified Spark Distribution” </a>program for vendors with a commercial Spark distribution. Certification indicates that the vendor’s Spark distribution is compatible with the open source Apache Spark distribution, enabling “Certified on Spark” applications - certified to work with Apache Spark - to run on the vendor’s Spark distribution out-of-the-box. “One of Databricks’ goals is to ensure users have a fantastic experience. Our belief is that having the community work together to maintain compatibility and therefore facilitate a vibrant app... |
| ["Costin Leau (Engineer at Elasticsearch)"] | ["Company Blog","Partners"] | {"createdOn":"2014-06-28","publishedOn":"2014-06-28","tz":"UTC"} | <div class="post-meta">This post is guest authored by our friends at <a href="http://www.elasticsearch.com" target="_blank">Elasticsearch</a> announcing Elasticsearch is now "Certified on Apache Spark", the first step in a collaboration to provide tighter integration between Elasticsearch and Spark.</div> <hr /> <h2>Elasticsearch Now “Certified on Spark”</h2> Helping businesses get insights out of their data, fast, is core to the mission of Elasticsearch. Being able to live wherever a business stores their data is obviously critical to that mission, and Hadoop is one of the leaders in providing a way for businesses to store massive amounts of data at scale. Over the course of the past year, we have been working hard to bring the power of our real-time search and analytics engine to the Hadoop ecosystem. Our Hadoop connector, Elasticsearch for Apache Hadoop, is compatible with the top three Hadoop distributions – Cloudera, Hortonworks and MapR – and today has achieved another exciting... |
| ["Jake Cornelius (SVP of Product Management at Pentaho)"] | ["Company Blog","Partners"] | {"createdOn":"2014-06-30","publishedOn":"2014-06-30","tz":"UTC"} | [sidenote]This post is guest authored by our friends at <a href="http://www.pentaho.com" target="_blank">Pentaho</a> after having their data integration and analytics platform <a href="http://www.databricks.com/certification" target="_blank">“Certified on Apache Spark.”</a>[/sidenote] <hr /> One of Pentaho’s great passions is to empower organizations to take advantage of amazing innovations in <a href="http://www.pentaho.com/what-is-big-data" target="_blank">Big Data</a> to solve new challenges using the existing skill sets they have in their organizations today. Our Pentaho Labs prototyping and innovation efforts around natively integrating data engineering and analytics with Big Data platforms like <a href="http://www.pentaho.com/what-is-hadoop" target="_blank">Hadoop</a> and <a href="http://www.pentaho.com/storm" target="_blank">Storm</a> have already led dozens of customers to deploy next-generation Big Data solutions. Examples of these solutions include <a href="http://www.pent... |
| ["SriSatish Ambati (CEO of 0xData)"] | ["Company Blog","Partners"] | {"createdOn":"2014-06-30","publishedOn":"2014-06-30","tz":"UTC"} | <div class="post-meta">This post is guest authored by our friends at <a href="http://www.0xdata.com" target="_blank">0xData</a> discussing the release of Sparkling Water - the integration of their H20 offering with the Apache Spark platform.</div> <hr /> <h3>H20 – The Killer-App on Apache Spark</h3> <img class="aligncenter size-full wp-image-62" src="https://databricks.com/wp-content/uploads/2014/06/Spark-+-H20.png" width="472" /> In-memory big data has come of age. The Apache Spark platform, with its elegant API, provides a unified platform for building data pipelines. H2O has focused on scalable machine learning as the API for big data applications. Spark + H2O combines the capabilities of H2O with the Spark platform – converging the aspirations of data science and developer communities. H2O is the Killer-Application for Spark. <img class="aligncenter size-full wp-image-62" src="https://databricks.com/wp-content/uploads/2014/06/H20-the-Killer-App.png" width="472" /> <h3>Backdrop<... |
| ["Databricks Press Office"] | ["Announcements","Company Blog"] | {"createdOn":"2014-06-30","publishedOn":"2014-06-30","tz":"UTC"} | <ul> <li>Databricks Cloud Allows Users to Get Value from Apache Spark without the Challenges Normally Associated with Big Data Infrastructure</li> <li>Ease-of-Use of Turnkey Solution Brings the Power of Spark to a Wider Audience and Fuels the Growth of the Spark Ecosystem</li> <li>Funding Led by NEA with Follow-on Investment from Andreessen Horowitz</li> </ul> <strong>Berkeley, Calif. (June 30, 2014)</strong>—Databricks, the company founded by the creators of Apache Spark—the powerful open-source processing engine that provides blazingly fast and sophisticated analytics—announced today the launch of <a title="Databricks Cloud" href="https://databricks.com/cloud">Databricks Cloud</a>, a cloud platform built around Apache Spark. In addition to this launch, the company is announcing the close of $33 million in series B funding led by New Enterprise Associates (NEA) with follow-on investment from Andreessen Horowitz. “Getting the full value out of their Big Data investments is still... |
| ["Arsalan Tavakoli-Shiraji"] | ["Company Blog","Events"] | {"createdOn":"2014-04-29","publishedOn":"2014-04-29","tz":"UTC"} | At Databricks, we’ve been thrilled to see the rapid pace of adoption of Apache Spark, as it has been embraced by an increasing number of enterprise vendors and has grown to be the most active open source project in the Hadoop ecosystem. We also know that a critical piece of enabling enterprises to unlock its potential is a strong ecosystem of applications built on top of or integrated with Spark. We launched the <a href="http://www.databricks.com/certification/">“Certified on Apache Spark”</a> program to support these application developer efforts, and have been blown away at the diverse set of applications being built on top of Spark, and want this great work to be exposed to the broader community. In that light, this year’s Spark Summit will have an “Application Spotlight” segment that will highlight some of the best we’ve seen. Read on for details on how to apply and what selection entails. All applications eligible (even if not yet certified) for the Databricks “Certified on Spar... |
| ["Arsalan Tavakoli-Shiraji"] | ["Company Blog","Partners"] | {"createdOn":"2014-05-08","publishedOn":"2014-05-08","tz":"UTC"} | <p>Today, Datastax and Databricks announced a partnership in which Apache Spark becomes an integral part of the Datastax offering, tightly integrated with Cassandra. We’re very excited to be embarking on this journey with Datastax for a multitude of reasons:</p> <h2 id="integrating-operational-systems-with-analytics">Integrating operational systems with analytics</h2> <p>One of the use cases that we’ve increasingly been asked about by Spark users is the ability to create a closed loop system: perform advanced analytics directly on operational data that is then fed back into the operational system to drive necessary adaptation. The tight integration of Cassandra and Spark will enable users to achieve this goal by leveraging Cassandra as the high-performance transactional database that powers online applications and Spark as a next generation processing engine that can deliver deeper insights, faster while seamlessly moving between the two.</p> <h2 id="spark-beyond-hadoop">Spark beyond... |
| ["Databricks Press Office"] | ["Announcements","Company Blog"] | {"createdOn":"2014-04-30","publishedOn":"2014-04-30","tz":"UTC"} | <p><strong>VANCOUVER, BC. – April 30, 2014 –</strong> Simba Technologies Inc., the industry’s expert for Big Data connectivity, announced today that Databricks has licensed Simba’s ODBC Driver as its standards-based connectivity solution for Shark, the SQL front-end for Apache Spark, the next generation Big Data processing engine. Founded by the creators of Apache Spark and Shark, Databricks is developing cutting-edge systems to enable enterprises to discover deeper insights, faster.</p> <p>“We believe that Big Data is a tremendous opportunity that is still largely untapped, and we are working to revolutionize what organizations can do with it,” says Ion Stoica, Chief Executive Officer at Databricks, and Professor of Computer Science at UC Berkeley. “As part of this mission, we understand that BI tools will continue to be a key medium for consuming data and analytics and are excited to announce the availability of an enterprise-grade connectivity option for users of BI tools. ... |
| ["Databricks Press Office"] | ["Announcements","Company Blog","Partners"] | {"createdOn":"2014-07-01","publishedOn":"2014-07-01","tz":"UTC"} | <strong>SAN FRANCISCO — July 1, 2014</strong> — Databricks, the company founded by the creators of Apache Spark – the popular open-source processing engine - today announced a new partnership with <a href="http://www.sap.com" target="_blank">SAP (NYSE: SAP)</a> and to deliver a Databricks-certified Apache Spark distribution offering for the SAP HANA® platform. The full production-ready distribution offering, based on Apache Spark 1.0, is deployable in the cloud or on premise and available for immediate download from SAP at no cost at <a href="http://spr.ly/SAP_and_Spark" target="_blank">spr.ly/SAP_and_Spark</a>. The announcement was made at the Spark Summit 2014, being held June 30 – July 2 in San Francisco. The Databricks-certified distribution offering for SAP HANA contains the Spark processing engine that works with any Hadoop distribution out of the box, providing a more complete data store and processing layer for Hadoop. Certified by Databricks to be compatible with the Apache ... |
| ["Arsalan Tavakoli-Shiraji"] | ["Company Blog","Partners"] | {"createdOn":"2014-07-01","publishedOn":"2014-07-01","tz":"UTC"} | This morning SAP released its own “Certified Spark Distribution” as part of a brand new partnership announced between Databricks and SAP. We’re thrilled to be embarking on this journey with them, not just because of what it means for Databricks as a company, but just as importantly because of what it means for Apache Spark and the Spark community. <h2>Access to the full corpus of data</h2> Fundamentally, every enterprise's big data vision is to convert data into value; a core ingredient in this quest is the availability of the data that needs to be mined for insights. Although the growth in volume of data sitting in HDFS has been incredible and continues to grow exponentially, much of this has been contextual data - e.g., social data, click-stream data, sensor data, logs, 3rd party data sources - and historical data. Real-time operational data - e.g., data from foundational enterprise applications such as ERP (Enterprise Resource Planning), CRM (Customer Relationship Management), and S... |
| ["Reynold Xin"] | ["Apache Spark","Engineering Blog"] | {"createdOn":"2014-07-02","publishedOn":"2014-07-02","tz":"UTC"} | With the introduction of Spark SQL and the new Hive on Apache Spark effort (<a href="https://issues.apache.org/jira/browse/HIVE-7292">HIVE-7292</a>), we get asked a lot about our position in these two projects and how they relate to Shark. At the <a href="http://spark-summit.org/2014">Spark Summit</a> today, we announced that we are ending development of Shark and will focus our resources towards Spark SQL, which will provide a superset of Shark’s features for existing Shark users to move forward. In particular, Spark SQL will provide both a seamless upgrade path from Shark 0.9 server and new features such as integration with general Spark programs. <img class="alignnone wp-image-818 size-large" src="https://databricks.com/wp-content/uploads/2014/07/sql-directions-1024x691.png" alt="Future of SQL on Spark" width="400" /> <h2>Shark</h2> When the Shark project started 3 years ago, Hive (on MapReduce) was the only choice for SQL on Hadoop. Hive compiled SQL into scalable MapReduce jobs a... |
| ["Ion Stoica"] | ["Company Blog","Product"] | {"createdOn":"2014-07-14","publishedOn":"2014-07-14","tz":"UTC"} | Our vision at Databricks is to <strong>make big data easy</strong> so that we enable <strong>every</strong> organization to turn its data into value. At Spark Summit 2014, we were very excited to unveil <a href="https://databricks.com/cloud" target="_blank">Databricks</a>, our first product towards fulfilling this vision. In this post, I’ll briefly go over the challenges that data scientists and data engineers face today when working with big data, and then show how Databricks addresses these challenges. <h2>Today’s Big Data Challenges</h2> While the promise of big data to <a href="http://spark-summit.org/2014/talk/using-spark-to-generate-analytics-for-international-cable-tv-video-distribution" target="_blank">improve businesses</a>, <a href="http://spark-summit.org/2014/talk/david-patterson" target="_blank">save lives</a>, and <a href="http://spark-summit.org/2014/talk/A-platform-for-large-scale-neuroscience" target="_blank">advance science</a> is becoming more and more real, analyzi... |
| ["Xiangrui Meng"] | ["Apache Spark","Engineering Blog","Machine Learning"] | {"createdOn":"2014-07-16","publishedOn":"2014-07-16","tz":"UTC"} | MLlib is an Apache Spark component focusing on machine learning. It became a standard component of Spark in version 0.8 (Sep 2013). The initial contribution was from Berkeley AMPLab. Since then, 50+ developers from the open source community have contributed to its codebase. With the release of Apache Spark 1.0, I’m glad to share some of the new features in MLlib. Among the most important ones are: <ul> <li>sparse data support</li> <li>regression and classification trees</li> <li>distributed matrices</li> <li>PCA and SVD</li> <li>L-BFGS optimization algorithm</li> <li>new user guide and code examples</li> </ul> This is the first in a series of blog posts about features and optimizations in MLlib. We will focus on one feature new in 1.0 — sparse data support. <h2>Large-scale ≈ Sparse</h2> When I was in graduate school, I wrote “large-scale sparse least squares” in a paper draft. My advisor crossed out the word “sparse” and left a comment: “Large-scale already implies sparsity... |
| ["Matei Zaharia"] | ["Apache Spark","Engineering Blog"] | {"createdOn":"2014-07-19","publishedOn":"2014-07-19","tz":"UTC"} | <div class="post-meta">This post originally appeared in <a href="http://inside-bigdata.com/2014/07/15/theres-spark-theres-fire-state-apache-spark-2014/" target="_blank">insideBIGDATA</a> and is reposted here with permission.</div> <hr /> With the second <a href="http://spark-summit.org/2014">Spark Summit</a> behind us, we wanted to take a look back at our journey since 2009 when Apache Spark, the fast and general engine for large-scale data processing, was initially developed. It has been exciting and extremely gratifying to watch Spark mature over the years, thanks in large part to the vibrant, open source community that latched onto it and busily began contributing to make Spark what it is today. The idea for Spark first emerged in the AMPLab (AMP stands for Algorithms, Machines, and People) at the University of California, Berkeley. With its significant industry funding and exposure, the AMPlab had a unique perspective on what is important and what issues exist among early adopte... |
| ["Burak Yavuz","Xiangrui Meng","Reynold Xin"] | ["Apache Spark","Engineering Blog","Machine Learning"] | {"createdOn":"2014-07-23","publishedOn":"2014-07-23","tz":"UTC"} | Recommendation systems are among the most popular applications of machine learning. The idea is to predict whether a customer would like a certain item: a product, a movie, or a song. Scale is a key concern for recommendation systems, since computational complexity increases with the size of a company's customer base. In this blog post, we discuss how Apache Spark MLlib enables building recommendation models from billions of records in just a few lines of Python (<a href="http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html">Scala/Java APIs also available</a>).<!--more--> [python] from pyspark.mllib.recommendation import ALS # load training and test data into (user, product, rating) tuples def parseRating(line): fields = line.split() return (int(fields[0]), int(fields[1]), float(fields[2])) training = sc.textFile("...").map(parseRating).cache() test = sc.textFile("...").map(parseRating) # train a recommendation model model = ALS.train(tra... |
| ["Li Pu","Reza Zadeh"] | ["Apache Spark","Engineering Blog","Machine Learning"] | {"createdOn":"2014-07-22","publishedOn":"2014-07-22","tz":"UTC"} | <div class="post-meta">Guest post by Li Pu from Twitter and Reza Zadeh from Databricks on their recent contribution to Apache Spark's machine learning library.</div> <hr /> The <a href="http://en.wikipedia.org/wiki/Singular_value_decomposition">Singular Value Decomposition (SVD)</a> is one of the cornerstones of linear algebra and has widespread application in many real-world modeling situations. Problems such as recommender systems, linear systems, least squares, and many others can be solved using the SVD. It is frequently used in statistics where it is related to principal component analysis (PCA) and to correspondence analysis, and in signal processing and pattern recognition. Another usage is latent semantic indexing in natural language processing. Decades ago, before the rise of distributed computing, computer scientists developed the single-core <a href="http://www.caam.rice.edu/software/ARPACK/">ARPACK package</a> for computing the eigenvalue decomposition of a matrix. Since... |
| ["Scott Walent"] | ["Company Blog","Events"] | {"createdOn":"2014-07-23","publishedOn":"2014-07-23","tz":"UTC"} | From June 30 to July 2, 2014 we held the <a href="http://spark-summit.org/2014">second Spark Summit</a>, a conference focused on promoting the adoption and growth of <a href="http://spark.apache.org">Apache Spark</a>. This was an exciting year for the Spark community and we are proud to share some highlights. <ul> <li>1,164 participants from over 453 companies attended</li> <li>Spark Training sold out at 300 participants</li> <li>31 organizations sponsored the event</li> <li>12 keynotes and 52 community presentations were given</li> </ul> Videos and slides from all presentations are now available on the <a href="http://spark-summit.org/2014/agenda">Summit 2014 agenda</a> page. Some highlights include: <ul> <li>Spark Summit <a href="https://www.youtube.com/watch?v=lO7LhVZrNwA&index=2&list=PL-x35fyliRwiST9gF7Z8Nu3LgJDFRuwfr">keynote from Databricks CEO Ion Stoica</a> introducing <a href="http://www.databricks.com/cloud">Databricks Cloud</a></li> <li>Open source comm... |
| ["Oscar Mendez (CEO of Stratio)"] | ["Company Blog","Partners"] | {"createdOn":"2014-08-08","publishedOn":"2014-08-08","tz":"UTC"} | <div class="post-meta">This is a guest post from our friends at <a href="http://www.stratio.com" target="_blank">Stratio</a> announcing that their platform is now a "Certified Apache Spark Distribution".</div> <hr /> <h2>Certified distribution</h2> Stratio is delighted to announce that it is officially a Certified Apache Spark Distribution. The certification is very important for us because we deeply believe that the certification program provides many benefits to the Spark community: It facilitates collaboration and integration, offers broad evolution and support for the rich Spark ecosystem, simplifies adoption of critical security updates and allows development of applications valid for any certified distribution - a key ingredient for a successful ecosystem. <!--more--> This post is a brief history of how we started with big data technologies until we made the shift to Spark. <h2>When Stratio met Spark: A true love story</h2> We started using Big Data technologies more than 7 yea... |
| ["Andy Huang (Alibaba Taobao Data Mining Team)","Wei Wu (Alibaba Taobao Data Mining Team)"] | ["Apache Spark","Engineering Blog","Machine Learning"] | {"createdOn":"2014-08-15","publishedOn":"2014-08-15","tz":"UTC"} | <div class="post-meta">This is a guest blog post from our friends at Alibaba Taobao.</div> <hr /> Alibaba Taobao operates one of the world’s largest e-commerce platforms. We collect hundreds of petabytes of data on this platform and use Apache Spark to analyze these enormous amounts of data. Alibaba Taobao probably runs some of the largest Spark jobs in the world. For example, some Spark jobs run for weeks to perform feature extraction on petabytes of image data. In this blog post, we share our experience with Spark and GraphX from prototype to production at the Alibaba Taobao Data Mining Team. <!--more--> Every day, hundreds of millions of users and merchants interact on Alibaba Taobao’s marketplace. These interactions can be expressed as complicated, large scale graphs. Mining data requires a distributed data processing engine that can support fast interactive queries as well as sophisticated algorithms. Spark and GraphX embed a standard set of graph mining algorithms, including ... |
| ["Doris Xin","Burak Yavuz","Xiangrui Meng","Hossein Falaki"] | ["Apache Spark","Engineering Blog","Machine Learning"] | {"createdOn":"2014-08-27","publishedOn":"2014-08-27","tz":"UTC"} | One of our philosophies in Apache Spark is to provide rich and friendly built-in libraries so that users can easily assemble data pipelines. With Spark, and MLlib in particular, quickly gaining traction among data scientists and machine learning practitioners, we’re observing a growing demand for data analysis support outside of model fitting. To address this need, we have started to add scalable implementations of common statistical functions to facilitate various components of a data pipeline. <!--more-->We’re pleased to announce Apache Spark 1.1. ships with built-in support for several statistical algorithms common in exploratory data pipelines: <ul> <li><strong>correlations</strong>: data dependence analysis</li> <li><strong>hypothesis testing</strong>: goodness of fit; independence test</li> <li><strong>stratified sampling</strong>: scaling training set with controlled label distribution</li> <li><strong>random data generation</strong>: randomized algorithms; performance t... |
| ["Patrick Wendell"] | ["Apache Spark","Engineering Blog","Streaming"] | {"createdOn":"2014-09-12","publishedOn":"2014-09-12","tz":"UTC"} | Today we’re thrilled to announce the release of Apache Spark 1.1! Apache Spark 1.1 introduces many new features along with scale and stability improvements. This post will introduce some key features of Apache Spark 1.1 and provide context on the priorities of Spark for this and the next release.<!--more--> In the next two weeks, we’ll be publishing blog posts with more details on feature additions in each of the major components. Apache Spark 1.1 is already available to Databricks customers and has also been posted today on the <a href="http://spark.apache.org/releases/spark-release-1-1-0.html">Apache Spark website</a>. <!--more--> <h2>Maturity of SparkSQL</h2> The 1.1 released upgrades Spark SQL significantly from the preview delivered in Apache Spark 1.0. At Databricks, we’ve migrated all of our customer workloads from Shark to Spark SQL, with between 2X and 5X <a href="https://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html">perfo... |
| ["Arsalan Tavakoli-Shiraji","Tathagata Das","Patrick Wendell"] | ["Apache Spark","Engineering Blog","Streaming"] | {"createdOn":"2014-09-16","publishedOn":"2014-09-16","tz":"UTC"} | With Apache Spark 1.1 recently released, we’d like to take this occasion to feature one of the most popular Spark components - Spark Streaming - and highlight who is using Spark Streaming and why. Apache Spark 1.1. adds several new features to Spark Streaming. In particular, Spark Streaming extends its library of ingestion sources to include Amazon Kinesis, a hosted stream processing engine, as well as to provide high availability for Apache Flume sources. Moreover, Apache Spark 1.1 adds the first of a set of online machine learning algorithms with the introduction of a streaming linear regression. Many organizations have evolved from exploratory, discovery use cases of big data to use cases that require reasoning on data as it arrives in order to make decisions in real time. Spark Streaming enables this category of high-value use cases, providing a system for processing fast and large streams of data in real time. <b>What is it?</b> Spark Streaming is an extension of the core S... |
| ["Burak Yavuz","Xiangrui Meng"] | ["Apache Spark","Engineering Blog","Machine Learning"] | {"createdOn":"2014-09-22","publishedOn":"2014-09-22","tz":"UTC"} | With an ever-growing community, Apache Spark has had it’s <a href="https://databricks.com/blog/2014/09/11/announcing-spark-1-1.html" target="_blank">1.1 release</a>. MLlib has had its fair share of contributions and now supports many new features. We are excited to share some of the performance improvements observed in MLlib since the 1.0 release, and discuss two key contributing factors: torrent broadcast and tree aggregation. <h2>Torrent broadcast</h2> The beauty of Spark as a unified framework is that any improvements made on the core engine come for free in its standard components like MLlib, Spark SQL, Streaming, and GraphX. In Apache Spark 1.1, we changed the default broadcast implementation of Spark from the traditional <code>HttpBroadcast</code> to <code>TorrentBroadcast</code>, a BitTorrent like protocol that evens out the load among the driver and the executors. When an object is broadcasted, the driver divides the serialized object into multiple chunks, and broadcasts the ch... |
| ["Gavin Targonski (Product Management at Talend)"] | ["Company Blog","Partners"] | {"createdOn":"2014-09-15","publishedOn":"2014-09-15","tz":"UTC"} | <div class="post-meta">This post is guest authored by our friends at <a href="http://www.talend.com" target="_blank">Talend</a> after having Talend Studio <a href="http://www.databricks.com/certification" target="_blank">“Certified on Apache Spark.”</a></div> <hr /> As the move to the next generation of integration platforms grows momentum, the need to implement a proven and scalable technology is critical. Databricks and Apache Spark, delivered on the major Hadoop distributions, is one such area where the delivery of massively scalable technology with low risk implementation is really key. At Talend we see a wide array of batch processes, moving to an operational and real time perspective, driven by the consumers of the data. In this vein, the uptake in adoption and the growing community of Apache Spark, the powerful open-source processing engine, has been hard to miss. In a relatively short time, it is now a part of every major Hadoop vendor’s offering, is the most active open sou... |
| ["Nick Pentreath (Graphflow)","Kan Zhang (IBM)"] | ["Apache Spark","Engineering Blog"] | {"createdOn":"2014-09-18","publishedOn":"2014-09-18","tz":"UTC"} | <div class="post-meta">This is a guest post by Nick Pentreath of <a href="http://graphflow.com">Graphflow</a> and Kan Zhang of <a href="http://ibm.com">IBM</a>, who contributed Python input/output format support to Apache Spark 1.1.</div> <hr /> Two powerful features of Apache Spark include its native APIs provided in Scala, Java and Python, and its compatibility with any Hadoop-based input or output source. This language support means that users can quickly become proficient in the use of Spark even without experience in Scala, and furthermore can leverage the extensive set of third-party libraries available (for example, the many data analysis libraries for Python). Built-in Hadoop support means that Spark can work "out of the box" with any data storage system or format that implements Hadoop's <code>InputFormat</code> and <code>OutputFormat</code> interfaces, including HDFS, HBase, Cassandra, Elasticsearch, DynamoDB and many others, as well as various data serialization formats s... |
| ["Vida Ha"] | ["Company Blog","Product"] | {"createdOn":"2014-09-24","publishedOn":"2014-09-24","tz":"UTC"} | At Databricks, we are often asked how to go beyond the basic Apache Spark tutorials and start building real applications with Spark. As a result, we are developing reference applications <a href="http://github.com/databricks/reference-apps" target="_blank">on github</a> to demonstrate that. We believe this is a great way to learn Spark, and we plan on incorporating more features of Spark into the applications over time. We also hope to highlight any technologies that are compatible with Spark and include best practices. <h3>Log Analyzer Application</h3> Our first reference application is log analysis with Spark. Logs are a large and common data set that contain a rich set of information. Log data can be used for monitoring web servers, improving business and customer intelligence, building recommendation systems, preventing fraud, and much more. Spark is a wonderful tool to use on logs - Spark can process logs faster than Hadoop MapReduce, it is easy to code so we can compute many... |
| ["John Tripier","Paco Nathan"] | ["Announcements","Company Blog"] | {"createdOn":"2014-09-19","publishedOn":"2014-09-19","tz":"UTC"} | When Databricks was initially founded a little more than a year ago, there was tremendous excitement around Apache Spark, but it was still early days. The project had ~60 contributors over the previous 12 months, and was not yet available commercially. One of our main focus areas since then has been continuing to grow Spark and the community and making it easily accessible for enterprises and users alike. Taking a step back, it’s terrific to see the progress that Spark has made since then. Spark is today the most active open source project in the Big Data ecosystem with over 300 contributors in the last 12 months alone, and is available through several platform vendors, including all of the major Hadoop distributors. The <a href="http://www.spark-summit.org" target="_blank">Spark Summit</a>, dedicated to bringing together the Spark community, more than doubled in size a short 6 months after the inaugural version, and Spark meetups continue to grow in size, frequency, and cities sp... |
| ["Christopher Burdorf (Senior Software Engineer at NBC Universal)"] | ["Company Blog","Customers"] | {"createdOn":"2014-09-24","publishedOn":"2014-09-24","tz":"UTC"} | <div class="post-meta">This is a guest blog post from our friends at NBC Universal outlining their Apache Spark use case.</div> <hr /> <h2>Business Challenge</h2> NBC Universal is one of the world’s largest media and entertainment companies with revenues of US$ 26 billion. It operates television networks, cable channels, motion picture and television production companies as well as branded theme parks worldwide. Popular brands include NBC, Universal Pictures, Universal Parks & Resorts, Telemundo, E!, Bravo and MSNBC. Digital video media clips for NBC Universal’s cable TV programs and commercials are produced and broadcast from its Los Angeles office to cable TV channels in Asia Pacific, Europe, Latin America and the United States. Moreover, viewers increasingly consume NBC Universal’s vast content library online and on-demand. Therefore, NBC Universal’s IT Infrastructure team needs to make decisions on how best to serve that content, which involves a trade-off between storage a... |
| ["Manish Amde (Origami Logic)","Joseph Bradley (Databricks)"] | ["Engineering Blog","Machine Learning"] | {"createdOn":"2014-09-30","publishedOn":"2014-09-30","tz":"UTC"} | <div class="post-meta">This is a post written together with one of our friends at <a href="http://www.origamilogic.com/">Origami Logic</a>. Origami Logic provides a Marketing Intelligence Platform that uses Apache Spark for heavy lifting analytics work on the backend.</div> <hr /> Decision trees and their ensembles are industry workhorses for the machine learning tasks of classification and regression. Decision trees are easy to interpret, handle categorical and continuous features, extend to multi-class classification, do not require feature scaling and are able to capture non-linearities and feature interactions. Due to their popularity, almost every machine learning library provides an implementation of the decision tree algorithm. However, most are designed for single-machine computation and seldom scale elegantly to a distributed setting. Apache Spark is an ideal platform for a scalable distributed decision tree implementation since Spark's in-memory computing allows us to effi... |
| ["Eric Carr (VP Core Systems Group at Guavus)"] | ["Company Blog","Partners"] | {"createdOn":"2014-09-25","publishedOn":"2014-09-25","tz":"UTC"} | <div class="post-meta">This is a guest blog post from our friends at <a href="http://www.guavus.com" target="_blank">Guavus</a> - now a Certified Apache Spark Distribution - outlining how they leverage Spark to deliver value to telecom companies.</div> <hr /> <h2>Business Challenge</h2> Guavus is a leading provider of big data analytics solutions for the Communications Service Provider (CSP) industry. The company counts 4 of the top 5 mobile network operators, 3 of the top 5 Internet backbone providers, as well as 80% of cable MSOs in North America as customers. The Guavus Reflex platform provides operational intelligence to these service providers. Reflex currently analyzes more than 50% of all US mobile data traffic and processes more than 2.5 petabytes of data per day. Yet that data grows at an exponential rate. Ever increasing data volume and velocity makes it harder to generate timely insights. For instance, one operational issue can quickly cascade into multiple issues down-st... |
| ["Jeremy Freeman (Freeman Lab)"] | ["Apache Spark","Engineering Blog","Streaming"] | {"createdOn":"2014-10-01","publishedOn":"2014-10-01","tz":"UTC"} | The brain is the most complicated organ of the body, and probably one of the most complicated structures in the universe. It’s millions of neurons somehow work together to endow organisms with the extraordinary ability to interact with the world around them. Things our brains control effortlessly -- kicking a ball, or reading and understanding this sentence -- have proven extremely hard to implement in a machine. For a long time, our efforts were limited by experimental technology. Despite the brain having many neurons, most technologies could only monitor the activity of one, or a handful, at once. That these approaches taught us so much -- for example, that there are neurons that respond only when you look at a particular object -- is a testament to experimental ingenuity. In the next era, however, we will be limited not by our recordings, but our ability to make sense of the data. New technologies make it possible to monitor the activity of many thousands of neurons at once -- fro... |
| ["Russell Cardullo (Sharethrough)"] | ["Company Blog","Customers"] | {"createdOn":"2014-10-07","publishedOn":"2014-10-07","tz":"UTC"} | <div class="post-meta">This is a guest blog post from our friends at <a href="http://www.sharethrough.com" target="_blank">Sharethrough</a> providing an update on how their use of Apache Spark has continued to expand.</div> <hr /> <h2>Business Challenge</h2> Sharethrough is an advertising technology company that provides native, in-feed advertising software to publishers and advertisers. Native, in-feed ads are designed to match the form and function of the sites they live on, which is particularly important on mobile devices where interruptive advertising is less effective. For publishers, in-feed monetization has become a major revenue stream for their mobile sites and applications. For advertisers, in-feed ads have been proven to drive more brand lift than interruptive banner advertisements. Sharethrough’s publisher and advertiser technology suite is capable of optimizing the format of an advertisement for seamless placement on content publishers websites and apps. This involves ... |
| ["Sean Kandel (CTO at Trifacta)"] | ["Company Blog","Partners"] | {"createdOn":"2014-10-09","publishedOn":"2014-10-09","tz":"UTC"} | <div class="post-meta">This post is guest authored by our friends at <a href="http://www.trifacta.com" target="_blank">Trifacta</a> after having their data transformation platform <a href="http://www.databricks.com/certification" target="_blank">“Certified on Spark.”</a></div> <hr> Today we announced v2 of the Trifacta Data Transformation Platform, a release that emphasizes the important role that Hadoop plays in the new big data enterprise architecture. With Trifacta v2 we now support transforming data of all shapes and sizes in Hadoop. This means supporting Hadoop-specific data formats as both inputs and outputs in Trifacta v2 - data formats such as Avro, ORC and Parquet. It also means intelligently executing data transformation scripts through not only MapReduce, which was available in Trifacta v1, but also Spark. Trifacta v2 has been officially Certified on Spark by Databricks. Our partnership with Databricks brings the performance and flexibility of the Spark data processing en... |
| ["Reynold Xin"] | ["Apache Spark","Engineering Blog"] | {"createdOn":"2014-10-10","publishedOn":"2014-10-10","tz":"UTC"} | <strong>Update November 5, 2014</strong>: Our benchmark entry has been reviewed by the benchmark committee and Apache Spark has won the <a href="http://sortbenchmark.org/">Daytona GraySort contest</a> for 2014! Please see this <a href="https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html">new blog post for update</a>. Apache Spark has seen phenomenal adoption, being widely slated as the successor to Hadoop MapReduce, and being deployed in clusters from a handful to thousands of nodes. While it was clear to everybody that Spark is more efficient than MapReduce for data that fits in memory, we heard that some organizations were having trouble pushing it to large scale datasets that could not fit in memory. Therefore, since the inception of Databricks, we have devoted much effort, together with the Spark community, to improve the stability, scalability, and performance of Spark. Spark works well for gigabytes or terabytes of data, and it s... |
| ["Reza Zadeh"] | ["Apache Spark","Engineering Blog","Machine Learning"] | {"createdOn":"2014-10-20","publishedOn":"2014-10-20","tz":"UTC"} | <div class="post-meta">Our friends at Twitter have contributed to MLlib, and this post uses material from Twitter’s description of its <a href="https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum" target="_blank">open-source contribution</a>, with permission. The associated <a href="https://github.com/apache/spark/pull/1778" target="_blank">pull request</a> is slated for release in Apache Spark 1.2.</div> <hr /> <h2>Introduction</h2> We are often interested in finding users, hashtags and ads that are very similar to one another, so they may be recommended and shown to users and advertisers. To do this, we must consider many pairs of items, and evaluate how “similar” they are to one another. We call this the “all-pairs similarity” problem, sometimes known as a “similarity join.” We have developed a new efficient algorithm to solve the similarity join called “Dimension Independent Matrix Square using MapReduce,” or <a href="http://arxiv.org/abs/1304.1467" target="_blank">DIM... |
| ["Jeff Feng (Product Manager at Tableau Software)"] | ["Company Blog","Partners"] | {"createdOn":"2014-10-15","publishedOn":"2014-10-15","tz":"UTC"} | <div class="post-meta">This post is guest authored by our friends at <a href="http://www.tableausoftware.com" target="_blank">Tableau Software</a>, whose visual analytics software is now <a href="http://www.databricks.com/certification" target="_blank">“Certified on Apache Spark.”</a></div> <hr /> <img class="aligncenter size-full wp-image-62" style="max-width: 100%; display: block; margin: 30px auto 5px auto;" src="https://databricks.com/wp-content/uploads/2014/10/Tableau-SparkSQL.png" alt="" align="middle" /> <h2>Apache Spark - The Next Big Innovation</h2> Once every few years or so, the big data open source community experiences a major innovation that advances the capabilities of data processing frameworks. For many years, MapReduce and the Hadoop open-source platform served as an effective foundation for the distributed processing of large data sets. Then last year, the introduction of YARN provided the resource manager needed to enable interactive workloads, bringing data proce... |
| ["Scott Walent"] | ["Announcements","Company Blog","Events"] | {"createdOn":"2014-10-23","publishedOn":"2014-10-23","tz":"UTC"} | The call for presentations for the inaugural <a href="http://spark-summit.org/east">Spark Summit East</a> is now open. Please join us in New York City on March 18-19, 2015 to share your experience with Apache Spark and celebrate its growing community. Spark Summit East is looking for presenters who would like to showcase how Spark and its related technologies are used in applications, development, data science and research. Please visit our <a href="http://www.spark-summit.org/east/2015/CFP">submission page</a> for additional details. The Deadline for submissions is December 5, 2014 at 11:59pm PST. Spark Summit East is the leading event for <a href="http://spark.apache.org">Apache Spark </a>users, developers and vendors. It is an exciting opportunity to meet analysts, researchers, developers and executives interested in utilizing Spark technology to answer big data questions. If you missed <a href="http://spark-summit.org/2014">Spark Summit 2014</a>, all the content is available onl... |
| ["Ari Himmel (CEO at Faimdata)","Nan Zhu (Chief Architect at Faimdata)"] | ["Company Blog","Partners"] | {"createdOn":"2014-10-27","publishedOn":"2014-10-27","tz":"UTC"} | <div class="post-meta">This post is guest authored by our friends at <a href="http://www.faimdata.com" target="_blank">Faimdata</a>, whose Consumer Data Intelligence Service is now <a href="http://www.databricks.com/certification" target="_blank">“Certified on Apache Spark.”</a></div> <hr /> <h2>Forecasting, Analytics, Intelligence, Machine Learning</h2> Faimdata’s Consumer Data Intelligence Service is a turnkey Big Data solution that provides comprehensive infrastructure and applications to retailers. We help our clients form close connections with their customers and make timely business decisions, using their existing data sources. The unified data processing pipeline deployed by Faimdata has three core focuses. They are (i) our Personalization Service that identifies the personal preferences and buying behaviors of each individual consumer using recommendation/machine learning algorithms; (ii) our Data Analytic Workbench where clients execute high performance multi-dimensional an... |
| ["John Kreisa (VP of Strategic Marketing at Hortonworks)"] | ["Company Blog","Partners"] | {"createdOn":"2014-10-31","publishedOn":"2014-10-31","tz":"UTC"} | <div class="post-meta">This post is guest authored by our friends at <a href="http://www.hortonworks.com" target="_blank">Hortonworks</a> announcing a broader partnership with Databricks around Apache Spark.</div> <hr> At Hortonworks we are very excited by the emerging use cases and potential of Apache Spark and Apache Hadoop. Spark is representative of just one of the shifts underway in the data landscape towards memory optimized processing, that when combined with Hadoop, can enable a new generation of applications. We are excited to announce that Hortonworks and Databricks have extended our partnership focus from providing a <a href="https://databricks.com/spark/certification/certified-spark-distribution" target="_blank">Certified Spark Distribution</a> to include a shared vision to further Apache Spark as an enterprise ready component of the Hortonworks Data Platform. We are closely aligned on a strategy and vision of bringing 100% open source software to market for the enterp... |
| ["Sachin Chawla (VP of Engineering)"] | ["Company Blog","Partners"] | {"createdOn":"2014-11-25","publishedOn":"2014-11-25","tz":"UTC"} | <div class="post-meta">This post is guest authored by our friends at <a href="http://www.skytree.net" target="_blank">Skytree</a>, whose Skytree Infinity platform is now <a href="http://www.databricks.com/certification" target="_blank">“Certified on Apache Spark.”</a></div> <hr /> <h2>To Infinity and Beyond - Big Data at the speed of light</h2> Astronomers were into Big Data before it was big. In order to learn about the history of the universe, they needed to observe and record billions and billions of astronomical objects and perform heavy-duty analysis on the resulting massive datasets. Available predictive methods were not scalable to the size of data sets they were dealing with so they turned to Skytree to obtain unprecedented performance and accuracy on the largest datasets ever collected. Fast-forward a decade or so and the need to store, access, process and analyze datasets of astronomical sizes is now mainstream in the guise of Big Data analytics. <a href="http://www.skytre... |
| ["Sonal Goyal (CEO)"] | ["Company Blog","Partners"] | {"createdOn":"2014-12-02","publishedOn":"2014-12-02","tz":"UTC"} | <div class="post-meta">This post is guest authored by our friends at <a href="http://nubetech.co/" target="_blank">Nube Technologies</a>, whose Reifier platform is now <a href="http://www.databricks.com/certification" target="_blank">“Certified on Apache Spark.”</a></div> <hr /> <h2>About Nube Technologies</h2> Nube Technologies builds business applications to better decision making through better data. Nube’s fuzzy matching product Reifier helps companies get a holistic view of enterprise data. By linking and resolving entities across various sources, Reifier helps optimize the sales and marketing funnel, promotes enhanced security and risk management and better consolidation and reporting of business data. We help our customers build better and effective models by ensuring that their underlying master data is accurate. <h2>Why Apache Spark</h2> Data matching within a single source or across sources is a very core problem faced by almost every enterprise and we wanted to create a re... |
| [" Dibyendu Bhattacharya (Big Data Architect)"] | ["Company Blog","Partners"] | {"createdOn":"2014-12-09","publishedOn":"2014-12-09","tz":"UTC"} | <div class="post-meta">This is a guest blog post from our friends at Pearson outlining their Apache Spark use case.</div> <hr /> <h2>Introduction of Pearson</h2> Pearson is a British multinational publishing and education company headquartered in London. It is the largest education company and the largest book publisher in the world. Recently, Pearson announced a new organization structure in order to accelerate their push into digital learning, education services and emerging markets. I am part of Pearson Higher Education group, which provides textbooks and digital technologies to teachers and students across Higher Education. Pearson's higher education brands include eCollege, Mastering/MyLabs and Financial Times Publishing. <h2>What we wanted to do</h2> We are building a next generation adaptive learning platform which delivers immersive learning experiences designed for the way today’s students read, think, and learn. This learning platform is a scalable, reliable, cloud-based pl... |
| ["Reynold Xin"] | ["Apache Spark","Engineering Blog"] | {"createdOn":"2014-11-05","publishedOn":"2014-11-05","tz":"UTC"} | A month ago, we shared with you our entry to the 2014 Gray Sort competition, a 3rd-party benchmark measuring how fast a system can sort 100 TB of data (1 trillion records). Today, we are happy to announce that our entry has been reviewed by the benchmark committee and we have officially won the <a href="http://sortbenchmark.org/">Daytona GraySort contest</a>! In case you missed our <a href="https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html">earlier blog post</a>, using Spark on 206 EC2 machines, we sorted 100 TB of data on disk in 23 minutes. In comparison, the previous world record set by Hadoop MapReduce used 2100 machines and took 72 minutes. This means that Apache Spark sorted the same data <strong>3X faster</strong> using <strong>10X fewer machines</strong>. All the sorting took place on disk (HDFS), without using Spark’s in-memory cache. This entry tied with a UCSD research team building high performance systems and we jointly set a new world record. <table class="... |
| ["Matt MacKinnon (Director of Product Management at Zaloni)"] | ["Company Blog","Partners"] | {"createdOn":"2014-11-14","publishedOn":"2014-11-14","tz":"UTC"} | <div class="post-meta">This post is guest authored by our friends at <a href="http://www.zaloni.com" target="_blank">Zaloni</a>, whose Bedrock platform is now <a href="http://www.databricks.com/certification" target="_blank">“Certified on Apache Spark.”</a></div> <hr /> <h2>Bedrock’s Managed Data Pipeline now includes Apache Spark</h2> It was evident from the all the buzz at the Strata + Hadoop World conference that Apache Spark has now shifted from the early adopter phase to establishing itself as an integral and permanent part of the Hadoop ecosystem. The rapid pace of adoption is impressive! Given the entrance of Spark into the mainstream Hadoop world, we are glad to announce that Bedrock is now officially Certified on Spark. <h2>How does Spark enhance Bedrock?</h2> Bedrock™ defines a Managed Data Pipeline as consisting of Ingest, Organize, and Prepare stages. Bedrock’s strength lies in the integrated nature of the way data is handled through these stages. ● Ingest: Bring data fr... |
| ["John Tripier","Paco Nathan"] | ["Announcements","Company Blog"] | {"createdOn":"2014-11-15","publishedOn":"2014-11-15","tz":"UTC"} | More and more companies are using Apache Spark, and many Spark based pilots are currently deploying in production. In social media, at every big data conference or meetup, people describe new POC, prototypes, and production deployments using Spark. Behind this momentum, a growing need for Spark developers is developing; people who have demonstrated expertise in how to implement best practices for Spark. People who can help the enterprise building increasingly complex and sophisticated solutions on top of their Spark deployments. At Databricks, we get contacted by many enterprises looking for Spark resources to help with their next data-driven initiative. And so beyond our effort to train people on Spark directly or through partners all around the world, we have teamed up with O’Reilly for offering the first industry standard for measuring and validating a developer’s expertise on Spark. <h2>Benefits of being a Spark Certified Developer</h2> The Spark Developer Certification is the wa... |
| ["Luis Quintela (Sr. Manager of Big Data Analytics)","Yan Breek (Data Scientist)","Girish Kathalagiri (Data Analytics Engineer)"] | ["Company Blog","Partners"] | {"createdOn":"2014-11-22","publishedOn":"2014-11-22","tz":"UTC"} | <div class="post-meta">This is a guest blog post from our friends at Samsung SDS outlining their Apache Spark use case.</div> <hr /> <h2>Business Challenge</h2> Samsung SDS is the business and IT solutions arm of Samsung Group. A global ICT service provider with over 17,000 employees worldwide and 6.7 billion USD in revenues, Samsung SDS tackles the challenges of some of the largest global enterprises in such industries as manufacturing, financial services, health care and retail. In the different areas Samsung is focused on, the ability to make timely decisions that maximize the value to a business becomes critical. Prescriptive analytics methods have been used effectively to support decision making by leveraging probable future outcomes determined by predictive models and suggesting actions that provide maximal business value. One of the main challenges in applying prescriptive analytics in these areas is the need to analyze a combination of structured and unstructured data at la... |
| ["Ameet Talwalkar","Anthony Joseph"] | ["Announcements","Company Blog"] | {"createdOn":"2014-12-02","publishedOn":"2014-12-02","tz":"UTC"} | In the age of ‘Big Data,’ with datasets rapidly growing in size and complexity and cloud computing becoming more pervasive, data science techniques are fast becoming core components of large-scale data processing pipelines. Apache Spark offers analysts and engineers a powerful tool for building these pipelines, and learning to build such pipelines will soon be a lot easier. Databricks is excited to be working with professors from University of California Berkeley and University of California Los Angeles to produce two new upcoming Massive Open Online Courses (MOOCs). Both courses will be freely available on the edX MOOC platform in <del>spring</del> summer 2015. edX Verified Certificates are also available for a fee. <img class="aligncenter size-full wp-image-62" style="max-width: 100%; display: block; margin: 30px auto 5px auto;" src="https://databricks.com/wp-content/uploads/2014/12/MOOC1.png" alt="" align="middle" /> The first course, called <a href="https://www.edx.org/course/uc... |
| ["Lieven Gesquiere (Virdata Lead Core R&D)"] | ["Company Blog","Partners"] | {"createdOn":"2014-12-04","publishedOn":"2014-12-04","tz":"UTC"} | <div class="post-meta">This post is guest authored by our friends at <a href="http://www.technicolor.com/" target="_blank">Technicolor</a>, whose Virdata platform is now <a href="http://www.databricks.com/certification" target="_blank">“Certified on Apache Spark.”</a></div> <hr /> <h2>About Virdata</h2> Virdata is Technicolor’s cloud-native Internet of Things platform offering real-time monitoring, configuration and management of the unprecedented number of connected devices and applications. Combining its highly-scalable data ingestion and messaging capabilities with real-time and historical analytics, Virdata brings value across multiple data-driven markets. The Virdata platform was launched at CES Las Vegas in January, 2014. The Virdata cloud-based platform architecture integrates state-of-the-art open source software components into a homogeneous, high-availability data-processing environment. <h2>Virdata and Apache Spark</h2> The Virdata solution architecture comprises 3 areas:... |
| ["by Databricks Press Office"] | ["Announcements","Company Blog"] | {"createdOn":"2015-01-13","publishedOn":"2015-01-13","tz":"UTC"} | <strong>Highlights:</strong> <ul> <li>Databricks Expands Bay Area Presence, Moves HQ to San Francisco</li> <li>Company Names Kavitha Mariappan as Marketing Vice President</li> </ul> Press Release: <a title="http://finance.yahoo.com/news/databricks-expands-bay-area-presence-140000610.html" href="http://finance.yahoo.com/news/databricks-expands-bay-area-presence-140000610.html">http://finance.yahoo.com/news/databricks-expands-bay-area-presence-140000610.html</a> <strong>San Francisco, Calif. – January 13, 2015 – </strong><a href="http://www.databricks.com">Databricks</a>, the company founded by the creators of the popular open-source Big Data processing engine Apache Spark with its flagship product, Databricks Cloud, today announced the relocation of their headquarters to San Francisco from Berkeley, California. The expansion is a reflection of Databricks’ growth heading into 2015. The company grew more than 200 percent in headcount over the last year and adds talent to its executive ... |
| ["Kavitha Mariappan"] | ["Announcements","Company Blog"] | {"createdOn":"2015-01-16","publishedOn":"2015-01-16","tz":"UTC"} | Complementing our on-going direct and partner-led Apache Spark training efforts, Databricks has teamed up with O’Reilly to offer the industry’s first standard for measuring and validating a developer’s expertise with Spark. Databricks and O’Reilly are proud to announce the online availability of the Spark Certified Developer exams. You can now sign up and take the exam online<a href=" http://go.databricks.com/spark-certified-developer"> here</a>. <b>What is the Spark Certified Developer program?</b> Apache Spark is the most active project in the Big Data ecosystem and is fast becoming the open source alternative of choice for many enterprises. Spark provides enterprises with the scale and sophistication they require to gain insights from their Big Data by providing a unified framework for building data pipelines. Databricks was founded by the team that created and continues to lead both development and training around Spark, and<a href="https://databricks.com/product"> Databricks Cl... |
| ["Kavitha Mariappan"] | ["Company Blog","Events"] | {"createdOn":"2015-01-20","publishedOn":"2015-01-20","tz":"UTC"} | We are thrilled to announce the availability of the <a href="http://go.spark-summit.org/e1t/c/*W6stDzJ6_3DYhW6Y-qp35L8r5j0/*W4PZ7v36VwsQzW58WPXZ57MJJH0/5/f18dQhb0Sq5z8YHrDTW8HLj0x5VQHw7W6bFhBV6P7FhxW4R4BZM57mvC2W1BQYgg4P0TLvW85Q81T83G7d1W9dtj1h7NQNCqW4zWTRG33K-8nW7NMj-x9bTNXYW954KlM4P0Yt6W2d4hSK3bWrh8W2YH1kR47xfHKW2HRyfR6trFPNW47YlYy4bfcHbW47Xx4z3C811XW4-SZvb2KQ2YYW3_VZwP5ThdHgW3s1XjF51G0BJW4Zh8Y-57-WqMW3H_Pty2DzCtRW1zBkSq1sQ3b4W8V-D1g5rcXhJW7JS0c27BQjYmVJB4Mm896Q7XW94B_1g7v78c8W8NqNPC5qWyC0W7JTtyJ2Xm03sW3FBZ5D9lNHw9W6_b40v3vyNkPW6J4Ypk8lBfs0W3bnqM_1C-9rFVL--5_1Pct9JW2mPjk95hqX5PW9lKhck4H6s3gN4m21WR6Q977Vb98_P6s16_2W8Ph58-59BvQ0W7y34GD1FmQY-W7r71Hq2PhWHMW7tprCG95RqNQW2j-Sgt2L5GhqW3G6xft6TMH99W6-cC_w3wXTtZW6Sytzy9fTwQmN3FYx-Q_HpmRf6dY7D511" target="_blank">agenda</a> for Spark Summit East 2015! This inaugural New York City event on <span class="aBn" tabindex="0" data-term="goog_929332804"><span class="aQJ">March 18-19, 2015</span></span> has over thirty jam-packed sessions – offering a ... |
| ["Yin Huai (Databricks)"] | ["Apache Spark","Engineering Blog"] | {"createdOn":"2015-02-02","publishedOn":"2015-02-02","tz":"UTC"} | [sidenote]Note: Starting Spark 1.3, SchemaRDD will be renamed to DataFrame.[/sidenote] <hr /> In this blog post, we introduce Spark SQL’s JSON support, a feature we have been working on at Databricks to make it dramatically easier to query and create JSON data in Spark. With the prevalence of web and mobile applications, JSON has become the de-facto interchange format for web service API’s as well as long-term storage. With existing tools, users often engineer complex pipelines to read and write JSON data sets within analytical systems. Spark SQL’s JSON support, released in Apache Spark 1.1 and enhanced in Apache Spark 1.2, vastly simplifies the end-to-end-experience of working with JSON data.<!--more--> <h2>Existing practices</h2> In practice, users often face difficulty in manipulating JSON data with modern analytical systems. To write a dataset to JSON format, users first need to write logic to convert their data to JSON. To read and query JSON datasets, a common practice is to us... |
| ["Jeremy Freeman (Howard Hughes Medical Institute)"] | ["Apache Spark","Engineering Blog","Streaming"] | {"createdOn":"2015-01-28","publishedOn":"2015-01-28","tz":"UTC"} | Many real world data are acquired sequentially over time, whether messages from social media users, time series from wearable sensors, or — in a case we are particularly excited about — the firing of large populations of neurons. In these settings, rather than wait for all the data to be acquired before performing our analyses, we can use streaming algorithms to identify patterns over time, and make more targeted predictions and decisions. One simple strategy is to build machine learning models on static data, and then use the learned model to make predictions on an incoming data stream. But what if the patterns in the data are themselves dynamic? That's where streaming algorithms come in. A key advantage of Apache Spark is that its machine learning library (MLlib) and its library for stream processing (Spark Streaming) are built on the same core architecture for distributed analytics. This facilitates adding extensions that leverage and combine components in novel ways without reinv... |
| ["Dave Wang (Databricks)"] | ["Announcements","Company Blog"] | {"createdOn":"2015-02-05","publishedOn":"2015-02-05","tz":"UTC"} | Recently <a href="http://www.infoworld.com/article/2871935/application-development/infoworlds-2015-technology-of-the-year-award-winners.html" target="_blank">Infoworld unveiled the 2015 Technology of the Year Award winners</a>, which range from open source software to stellar consumer technologies like the iPhone. Being the <a title="Announcing Spark 1.2" href="https://databricks.com/blog/2014/12/19/announcing-spark-1-2.html" target="_blank">creators behind Apache Spark</a>, Databricks is thrilled to see Spark in their ranks. In fact, we built our flagship product, <a title="Databricks Cloud Overview" href="https://databricks.com/product">Databricks</a>, on top of Spark with the ambition to revolutionize big data processing in ways similar to how iPhone revolutionized the mobile experience. The iPhone was revolutionary in a number of ways: first, it integrated a disparate set of consumer electronic capabilities such as mobile phone, camera, GPS, and even laptop; second, it created a... |
| ["Patrick Wendell"] | ["Apache Spark","Engineering Blog"] | {"createdOn":"2014-12-19","publishedOn":"2014-12-19","tz":"UTC"} | We at Databricks are thrilled to announce the release of Apache Spark 1.2! Apache Spark 1.2 introduces many new features along with scalability, usability and performance improvements. This post will introduce some key features of Apache Spark 1.2 and provide context on the priorities of Spark for this and the next release. In the next two weeks, we’ll be publishing blog posts with more details on feature additions in each of the major components. Apache Spark 1.2 has been posted today on the <a href="http://spark.apache.org/releases/spark-release-1-2-0.html">Apache Spark website</a>. Learn more about specific new features in related in-depth posts: <ul> <li><a title="Spark SQL Data Sources API: Unified Data Access for the Spark Platform" href="https://databricks.com/blog/2015/01/09/spark-sql-data-sources-api-unified-data-access-for-the-spark-platform.html" target="_blank">Spark SQL data sources API</a></li> <li><a title="An introduction to JSON support in Spark SQL" href="https:/... |
| ["Xiangrui Meng","Patrick Wendell"] | ["Apache Spark","Ecosystem","Engineering Blog"] | {"createdOn":"2014-12-22","publishedOn":"2014-12-22","tz":"UTC"} | Today, we are happy to announce <em>Apache Spark Packages</em> (<a title="http://spark-packages.org" href="http://spark-packages.org">http://spark-packages.org</a>), a community package index to track the growing number of open source packages and libraries that work with Apache Spark. <em>Spark Packages</em> makes it easy for users to find, discuss, rate, and install packages for any version of Spark, and makes it easy for developers to contribute packages. <!--more--> <em>Spark Packages</em> will feature integrations with various data sources, management tools, higher level domain-specific libraries, machine learning algorithms, code samples, and other Spark content. Thanks to the package authors, the initial listing of packages includes <a href="http://spark-packages.org/package/6">scientific computing libraries</a>, a <a href="http://spark-packages.org/package/10">job execution server</a>, a connector for <a href="http://spark-packages.org/package/3">importing Avro data</a>, tool... |
| ["Xiangrui Meng","Joseph Bradley","Evan Sparks (UC Berkeley)","Shivaram Venkataraman (UC Berkeley)"] | ["Engineering Blog","Machine Learning"] | {"createdOn":"2015-01-07","publishedOn":"2015-01-07","tz":"UTC"} | MLlib’s goal is to make practical machine learning (ML) scalable and easy. Besides new algorithms and performance improvements that we have seen in each release, a great deal of time and effort has been spent on making MLlib <i>easy</i>. Similar to Spark Core, MLlib provides APIs in three languages: Python, Java, and Scala, along with user guide and example code, to ease the learning curve for users coming from different backgrounds. In Apache Spark 1.2, Databricks, jointly with AMPLab, UC Berkeley, continues this effort by introducing a pipeline API to MLlib for easy creation and tuning of practical ML pipelines. A practical ML pipeline often involves a sequence of data pre-processing, feature extraction, model fitting, and validation stages. For example, classifying text documents might involve text segmentation and cleaning, extracting features, and training a classification model with cross-validation. Though there are many libraries we can use for each stage, connecting the dots ... |
| ["Michael Armbrust"] | ["Apache Spark","Engineering Blog"] | {"createdOn":"2015-01-09","publishedOn":"2015-01-09","tz":"UTC"} | Since the inception of Spark SQL in Apache Spark 1.0, one of its most popular uses has been as a conduit for pulling data into the Spark platform. Early users loved Spark SQL’s support for reading data from existing Apache Hive tables as well as from the popular Parquet columnar format. We’ve since added support for other formats, such as <a href="https://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets">JSON</a>. In Apache Spark 1.2, we've taken the next step to allow Spark to integrate natively with a far larger number of input sources. These new integrations are made possible through the inclusion of the new Spark SQL Data Sources API. <a href="https://databricks.com/wp-content/uploads/2015/01/DataSourcesApiDiagram.png"><img class="wp-image-2372 aligncenter" src="https://databricks.com/wp-content/uploads/2015/01/DataSourcesApiDiagram-1024x526.png" alt="DataSourcesApiDiagram" width="516" height="265" /></a> The Data Sources API provides a pluggable mechanism... |
| ["Joseph K. Bradley (Databricks)","Manish Amde (Origami Logic)"] | ["Apache Spark","Engineering Blog","Machine Learning"] | {"createdOn":"2015-01-21","publishedOn":"2015-01-21","tz":"UTC"} | <div class="post-meta">This is a post written together with Manish Amde from <a href="http://www.origamilogic.com/">Origami Logic</a>.</div> <hr /> Apache Spark 1.2 introduces <a href="http://en.wikipedia.org/wiki/Random_forest">Random Forests</a> and <a href="http://en.wikipedia.org/wiki/Gradient_boosting#Gradient_tree_boosting">Gradient-Boosted Trees (GBTs)</a> into MLlib. Suitable for both classification and regression, they are among the most successful and widely deployed machine learning methods. Random Forests and GBTs are <i>ensemble learning algorithms</i>, which combine multiple decision trees to produce even more powerful models. In this post, we describe these models and the distributed implementation in MLlib. We also present simple examples and provide pointers on how to get started. <h2>Ensemble Methods</h2> Simply put, <a href="http://en.wikipedia.org/wiki/Ensemble_learning">ensemble learning algorithms</a> build upon other machine learning methods by combining models... |
| ["Tathagata Das"] | ["Apache Spark","Engineering Blog","Streaming"] | {"createdOn":"2015-01-15","publishedOn":"2015-01-15","tz":"UTC"} | Real-time stream processing systems must be operational 24/7, which requires them to recover from all kinds of failures in the system. Since its beginning, Apache Spark Streaming has included support for recovering from failures of both driver and worker machines. However, for some data sources, input data could get lost while recovering from the failures. In Apache Spark 1.2, we have added preliminary support for write ahead logs (also known as journaling) to Spark Streaming to improve this recovery mechanism and give stronger guarantees of zero data loss for more data sources. In this blog, we are going to elaborate on how this feature works and how developers can enable it to get those guarantees in Spark Streaming applications. <h2>Background</h2> Spark and its RDD abstraction is designed to seamlessly handle failures of any worker nodes in the cluster. Since Spark Streaming is built on Spark, it enjoys the same fault-tolerance for worker nodes. However, the demand of high uptimes ... |
| ["Kavitha Mariappan"] | ["Announcements","Company Blog"] | {"createdOn":"2015-01-27","publishedOn":"2015-01-27","tz":"UTC"} | In partnership with <a href="https://typesafe.com/">Typesafe</a>, we are excited to see the publication of the <a href="http://info.typesafe.com/COLL-20XX-Spark-Survey-Report_LP.html?lst=PR&lsd=COLL-20XX-Spark-Survey-Trends-Adoption-Report">survey report</a> representing the largest poll of Apache Spark developers to date. Spark is currently the most active open source project in big data and has been rapidly gaining traction over the past few years. This survey of over 2100 respondents further validates the wide variety of use cases and environments where it is being deployed. The survey results indicate that 13% are already using Spark in production environments with 20% of the respondents with plans to deploy Spark in production environments in 2015, and 31% are currently in the process of evaluating it. In total, the survey covers over 500 enterprises that are using or planning to use Spark in production environments ranging from on-premise Hadoop clusters to public clouds, wi... |
| ["Holden Karau","Andy Konwinski","Patrick Wendell","Matei Zaharia"] | ["Announcements","Company Blog"] | {"createdOn":"2015-02-09","publishedOn":"2015-02-09","tz":"UTC"} | <a href="https://databricks.com/wp-content/uploads/2015/02/large-oreilly-book-cover.jpg"><img class="size-medium wp-image-2486 aligncenter" src="https://databricks.com/wp-content/uploads/2015/02/large-oreilly-book-cover-228x300.jpg" alt="large oreilly book cover" width="228" height="300" /></a> Today we are happy to announce that the complete <a href="http://shop.oreilly.com/product/0636920028512.do" target="_blank"><i>Learning Spark</i></a> book is available from O’Reilly in e-book form with the print copy expected to be available February 16th. At Databricks, as the creators behind Apache Spark, we have witnessed <a title="Big data projects are hungry for simpler and more powerful tools: Survey validates Apache Spark is gaining developer traction!" href="https://databricks.com/blog/2015/01/27/big-data-projects-are-hungry-for-simpler-and-more-powerful-tools-survey-validates-apache-spark-is-gaining-developer-traction.html" target="_blank">explosive growth in the interest and adoption ... |
| null | ["Announcements","Company Blog","Customers"] | {"createdOn":"2015-02-13","publishedOn":"2015-02-13","tz":"UTC"} | We're really excited to share that <a href="http://www.automatic.com">Automatic Labs </a>has selected Databricks as its preferred big data processing platform. Press release: <a href="http://www.marketwired.com/press-release/automatic-labs-turns-databricks-cloud-faster-innovation-dramatic-cost-savings-1991316.htm" target="_blank">http://www.marketwired.com/press-release/automatic-labs-turns-databricks-cloud-faster-innovation-dramatic-cost-savings-1991316.htm</a> Automatic Labs needed to run large and complex queries against their entire data set to explore and come up with new product ideas. Their prior solution using Postgres impeded the ability of Automatic’s team to efficiently explore data because queries took days to run and data could not be easily visualized, preventing Automatic Labs from bringing critical new products to market. They then deployed Databricks, our simple yet powerful unified big data processing platform on Amazon Web Services (AWS) and realized these key bene... |
| null | ["Apache Spark","Engineering Blog"] | {"createdOn":"2015-02-14","publishedOn":"2015-02-14","tz":"UTC"} | 2014 has been a year of <a href="https://databricks.com/blog/2015/01/27/big-data-projects-are-hungry-for-simpler-and-more-powerful-tools-survey-validates-apache-spark-is-gaining-developer-traction.html">tremendous growth</a> for Apache Spark. It became the most active open source project in the Big Data ecosystem with over 400 contributors, and was adopted by many platform vendors - including all of the major Hadoop distributors. Through our ecosystem of products, partners, and training at Databricks, we also saw over 200 enterprises deploying Spark in production. To help Spark achieve this growth, Databricks has worked broadly throughout the project to improve functionality and ease of use. Indeed, while the community has grown a lot, about 75% of the code added to Spark last year came from Databricks. In this post, we would like to highlight some of the additions we made to Spark in 2014, and provide a preview of our priorities in 2015. In general, our approach to developing Spar... |
| null | ["Company Blog","Partners"] | {"createdOn":"2015-02-19","publishedOn":"2015-02-19","tz":"UTC"} | This is a guest blog from our one of our partners: <a href="http://www.memsql.com/" target="_blank">MemSQL</a> <hr /> <h2>Summary</h2> Coupling operational data with the most advanced analytics puts data-driven business ahead. The MemSQL Apache Spark Connector enables such configurations. <h2>Meeting Transactional and Analytical Needs</h2> Transactional databases form the core of modern business operations. Whether that transaction is financial, physical in terms of inventory changes, or experiential in terms of a customer engagement, the transaction itself moves our business forward. But while transactions represent the state of our business, analytics tell us patterns of the past, and help us predict patterns of the future. Analytics can tell us what levers influence profitability and put us ahead of the pack. Success in digital business requires both transactional and analytical prowess, including the foremost means to analyze data. <h2>Speed and Agility with MemSQL and A... |
| null | ["Apache Spark","Engineering Blog"] | {"createdOn":"2015-02-17","publishedOn":"2015-02-17","tz":"UTC"} | Today, we are excited to announce a new DataFrame API designed to make big data processing even easier for a wider audience. When we first open sourced Apache Spark, we aimed to provide a simple API for distributed data processing in general-purpose programming languages (Java, Python, Scala). Spark enabled distributed data processing through functional transformations on distributed collections of data (RDDs). This was an incredibly powerful API: tasks that used to take thousands of lines of code to express could be reduced to dozens. As Spark continues to grow, we want to enable wider audiences beyond “Big Data” engineers to leverage the power of distributed processing. The new DataFrames API was created with this goal in mind. This API is inspired by data frames in R and Python (Pandas), but designed from the ground-up to support modern big data and data science applications. As an extension to the existing RDD API, DataFrames feature: <ul> <li>Ability to scale from kilobytes o... |
| null | ["Company Blog","Events"] | {"createdOn":"2015-02-24","publishedOn":"2015-02-24","tz":"UTC"} | The Strata + Hadoop World Conference in San Jose last week was abuzz with "putting data to work" in keeping with this year's conference theme. This was a significant shift from last year's event where organizations were highly focused on getting their arms around their big data projects and being steeped in evaluating the multitude of tools of new technologies available. Last week's event highlighted what is top of mind for enterprises and developers alike - how to turn their big data initiatives and projects into real business results? One theme was loud and clear - Apache Spark's flame shone bright! Derrick Harris from GigaOM summed this up aptly in his article "<a href="https://gigaom.com/2015/02/20/for-now-spark-looks-like-the-future-of-big-data/" target="_blank">For now, Spark looks like the future of big data</a>". To quote Derrick, <em>"Titles can be misleading. For example, the O’Reilly Strata + Hadoop World conference took place in San Jose, California, this week but Hadoop ... |
| null | ["Company Blog","Product"] | {"createdOn":"2015-03-04","publishedOn":"2015-03-04","tz":"UTC"} | <div class="article-body"> Enterprises have been collecting ever-larger amounts of data with the goal of extracting insights and creating value. Yet despite a few innovative companies who are able to successfully exploit big data, the promised returns of big data remain elusive beyond the grasp of many enterprises. One notable and rapidly growing open source technology that has emerged in the big data space is Apache Spark. Spark is an open source data processing framework that was built for speed, ease of use, and scale. Much of its benefits are due to how it unifies critical data analytics capabilities such as SQL, machine learning and streaming in a single framework. This enables enterprises to simultaneously achieve high performance computing at scale while simplifying their data processing infrastructure by avoiding the difficult integration of many disparate and difficult tools with a single powerful yet simple alternative. While Spark appears to have the potential to solve m... |
| authors | categories | dates | content |
|---|
Showing the first 156 rows.
Last refresh: Never
datesDF = databricksBlogDF.select("dates") display(datesDF)
| {"createdOn":"2014-04-10","publishedOn":"2014-04-10","tz":"UTC"} |
| {"createdOn":"2014-04-10","publishedOn":"2014-04-10","tz":"UTC"} |
| {"createdOn":"2014-04-01","publishedOn":"2014-04-01","tz":"UTC"} |
| {"createdOn":"2014-03-27","publishedOn":"2014-03-27","tz":"UTC"} |
| {"createdOn":"2014-02-04","publishedOn":"2014-02-04","tz":"UTC"} |
| {"createdOn":"2014-01-02","publishedOn":"2014-01-02","tz":"UTC"} |
| {"createdOn":"2014-03-26","publishedOn":"2014-03-26","tz":"UTC"} |
| {"createdOn":"2014-03-21","publishedOn":"2014-03-21","tz":"UTC"} |
| {"createdOn":"2014-03-19","publishedOn":"2014-03-19","tz":"UTC"} |
| {"createdOn":"2014-03-03","publishedOn":"2014-03-03","tz":"UTC"} |
| {"createdOn":"2014-02-13","publishedOn":"2014-02-13","tz":"UTC"} |
| {"createdOn":"2014-02-11","publishedOn":"2014-02-11","tz":"UTC"} |
| {"createdOn":"2014-01-22","publishedOn":"2014-01-22","tz":"UTC"} |
| {"createdOn":"2013-12-20","publishedOn":"2013-12-20","tz":"UTC"} |
| {"createdOn":"2013-12-19","publishedOn":"2013-12-19","tz":"UTC"} |
| {"createdOn":"2013-11-22","publishedOn":"2013-11-22","tz":"UTC"} |
| {"createdOn":"2013-10-29","publishedOn":"2013-10-29","tz":"UTC"} |
| {"createdOn":"2013-10-28","publishedOn":"2013-10-28","tz":"UTC"} |
| {"createdOn":"2013-10-27","publishedOn":"2013-10-27","tz":"UTC"} |
| {"createdOn":"2014-04-11","publishedOn":"2014-04-11","tz":"UTC"} |
| {"createdOn":"2014-04-15","publishedOn":"2014-04-15","tz":"UTC"} |
| {"createdOn":"2014-06-02","publishedOn":"2014-06-02","tz":"UTC"} |
| {"createdOn":"2014-05-23","publishedOn":"2014-05-23","tz":"UTC"} |
| {"createdOn":"2014-05-23","publishedOn":"2014-05-23","tz":"UTC"} |
| {"createdOn":"2014-05-30","publishedOn":"2014-05-30","tz":"UTC"} |
| {"createdOn":"2014-06-02","publishedOn":"2014-06-02","tz":"UTC"} |
| {"createdOn":"2014-06-04","publishedOn":"2014-06-04","tz":"UTC"} |
| {"createdOn":"2014-06-11","publishedOn":"2014-06-11","tz":"UTC"} |
| {"createdOn":"2014-06-12","publishedOn":"2014-06-12","tz":"UTC"} |
| {"createdOn":"2014-06-13","publishedOn":"2014-06-13","tz":"UTC"} |
| {"createdOn":"2014-06-23","publishedOn":"2014-06-23","tz":"UTC"} |
| {"createdOn":"2014-06-24","publishedOn":"2014-06-24","tz":"UTC"} |
| {"createdOn":"2014-06-26","publishedOn":"2014-06-26","tz":"UTC"} |
| {"createdOn":"2014-06-28","publishedOn":"2014-06-28","tz":"UTC"} |
| {"createdOn":"2014-06-30","publishedOn":"2014-06-30","tz":"UTC"} |
| {"createdOn":"2014-06-30","publishedOn":"2014-06-30","tz":"UTC"} |
| {"createdOn":"2014-06-30","publishedOn":"2014-06-30","tz":"UTC"} |
| {"createdOn":"2014-04-29","publishedOn":"2014-04-29","tz":"UTC"} |
| {"createdOn":"2014-05-08","publishedOn":"2014-05-08","tz":"UTC"} |
| {"createdOn":"2014-04-30","publishedOn":"2014-04-30","tz":"UTC"} |
| {"createdOn":"2014-07-01","publishedOn":"2014-07-01","tz":"UTC"} |
| {"createdOn":"2014-07-01","publishedOn":"2014-07-01","tz":"UTC"} |
| {"createdOn":"2014-07-02","publishedOn":"2014-07-02","tz":"UTC"} |
| {"createdOn":"2014-07-14","publishedOn":"2014-07-14","tz":"UTC"} |
| {"createdOn":"2014-07-16","publishedOn":"2014-07-16","tz":"UTC"} |
| {"createdOn":"2014-07-19","publishedOn":"2014-07-19","tz":"UTC"} |
| {"createdOn":"2014-07-23","publishedOn":"2014-07-23","tz":"UTC"} |
| {"createdOn":"2014-07-22","publishedOn":"2014-07-22","tz":"UTC"} |
| {"createdOn":"2014-07-23","publishedOn":"2014-07-23","tz":"UTC"} |
| {"createdOn":"2014-08-08","publishedOn":"2014-08-08","tz":"UTC"} |
| {"createdOn":"2014-08-15","publishedOn":"2014-08-15","tz":"UTC"} |
| {"createdOn":"2014-08-27","publishedOn":"2014-08-27","tz":"UTC"} |
| {"createdOn":"2014-09-12","publishedOn":"2014-09-12","tz":"UTC"} |
| {"createdOn":"2014-09-16","publishedOn":"2014-09-16","tz":"UTC"} |
| {"createdOn":"2014-09-22","publishedOn":"2014-09-22","tz":"UTC"} |
| {"createdOn":"2014-09-15","publishedOn":"2014-09-15","tz":"UTC"} |
| {"createdOn":"2014-09-18","publishedOn":"2014-09-18","tz":"UTC"} |
| {"createdOn":"2014-09-24","publishedOn":"2014-09-24","tz":"UTC"} |
| {"createdOn":"2014-09-19","publishedOn":"2014-09-19","tz":"UTC"} |
| {"createdOn":"2014-09-24","publishedOn":"2014-09-24","tz":"UTC"} |
| {"createdOn":"2014-09-30","publishedOn":"2014-09-30","tz":"UTC"} |
| {"createdOn":"2014-09-25","publishedOn":"2014-09-25","tz":"UTC"} |
| {"createdOn":"2014-10-01","publishedOn":"2014-10-01","tz":"UTC"} |
| {"createdOn":"2014-10-07","publishedOn":"2014-10-07","tz":"UTC"} |
| {"createdOn":"2014-10-09","publishedOn":"2014-10-09","tz":"UTC"} |
| {"createdOn":"2014-10-10","publishedOn":"2014-10-10","tz":"UTC"} |
| {"createdOn":"2014-10-20","publishedOn":"2014-10-20","tz":"UTC"} |
| {"createdOn":"2014-10-15","publishedOn":"2014-10-15","tz":"UTC"} |
| {"createdOn":"2014-10-23","publishedOn":"2014-10-23","tz":"UTC"} |
| {"createdOn":"2014-10-27","publishedOn":"2014-10-27","tz":"UTC"} |
| {"createdOn":"2014-10-31","publishedOn":"2014-10-31","tz":"UTC"} |
| {"createdOn":"2014-11-25","publishedOn":"2014-11-25","tz":"UTC"} |
| {"createdOn":"2014-12-02","publishedOn":"2014-12-02","tz":"UTC"} |
| {"createdOn":"2014-12-09","publishedOn":"2014-12-09","tz":"UTC"} |
| {"createdOn":"2014-11-05","publishedOn":"2014-11-05","tz":"UTC"} |
| {"createdOn":"2014-11-14","publishedOn":"2014-11-14","tz":"UTC"} |
| {"createdOn":"2014-11-15","publishedOn":"2014-11-15","tz":"UTC"} |
| {"createdOn":"2014-11-22","publishedOn":"2014-11-22","tz":"UTC"} |
| {"createdOn":"2014-12-02","publishedOn":"2014-12-02","tz":"UTC"} |
| {"createdOn":"2014-12-04","publishedOn":"2014-12-04","tz":"UTC"} |
| {"createdOn":"2015-01-13","publishedOn":"2015-01-13","tz":"UTC"} |
| {"createdOn":"2015-01-16","publishedOn":"2015-01-16","tz":"UTC"} |
| {"createdOn":"2015-01-20","publishedOn":"2015-01-20","tz":"UTC"} |
| {"createdOn":"2015-02-02","publishedOn":"2015-02-02","tz":"UTC"} |
| {"createdOn":"2015-01-28","publishedOn":"2015-01-28","tz":"UTC"} |
| {"createdOn":"2015-02-05","publishedOn":"2015-02-05","tz":"UTC"} |
| {"createdOn":"2014-12-19","publishedOn":"2014-12-19","tz":"UTC"} |
| {"createdOn":"2014-12-22","publishedOn":"2014-12-22","tz":"UTC"} |
| {"createdOn":"2015-01-07","publishedOn":"2015-01-07","tz":"UTC"} |
| {"createdOn":"2015-01-09","publishedOn":"2015-01-09","tz":"UTC"} |
| {"createdOn":"2015-01-21","publishedOn":"2015-01-21","tz":"UTC"} |
| {"createdOn":"2015-01-15","publishedOn":"2015-01-15","tz":"UTC"} |
| {"createdOn":"2015-01-27","publishedOn":"2015-01-27","tz":"UTC"} |
| {"createdOn":"2015-02-09","publishedOn":"2015-02-09","tz":"UTC"} |
| {"createdOn":"2015-02-13","publishedOn":"2015-02-13","tz":"UTC"} |
| {"createdOn":"2015-02-14","publishedOn":"2015-02-14","tz":"UTC"} |
| {"createdOn":"2015-02-19","publishedOn":"2015-02-19","tz":"UTC"} |
| {"createdOn":"2015-02-17","publishedOn":"2015-02-17","tz":"UTC"} |
| {"createdOn":"2015-02-24","publishedOn":"2015-02-24","tz":"UTC"} |
| {"createdOn":"2015-03-04","publishedOn":"2015-03-04","tz":"UTC"} |
| dates |
|---|
Last refresh: Never
display(databricksBlogDF.select("dates.createdOn", "dates.publishedOn"))
| 2014-04-10 | 2014-04-10 |
| 2014-04-10 | 2014-04-10 |
| 2014-04-01 | 2014-04-01 |
| 2014-03-27 | 2014-03-27 |
| 2014-02-04 | 2014-02-04 |
| 2014-01-02 | 2014-01-02 |
| 2014-03-26 | 2014-03-26 |
| 2014-03-21 | 2014-03-21 |
| 2014-03-19 | 2014-03-19 |
| 2014-03-03 | 2014-03-03 |
| 2014-02-13 | 2014-02-13 |
| 2014-02-11 | 2014-02-11 |
| 2014-01-22 | 2014-01-22 |
| 2013-12-20 | 2013-12-20 |
| 2013-12-19 | 2013-12-19 |
| 2013-11-22 | 2013-11-22 |
| 2013-10-29 | 2013-10-29 |
| 2013-10-28 | 2013-10-28 |
| 2013-10-27 | 2013-10-27 |
| 2014-04-11 | 2014-04-11 |
| 2014-04-15 | 2014-04-15 |
| 2014-06-02 | 2014-06-02 |
| 2014-05-23 | 2014-05-23 |
| 2014-05-23 | 2014-05-23 |
| 2014-05-30 | 2014-05-30 |
| 2014-06-02 | 2014-06-02 |
| 2014-06-04 | 2014-06-04 |
| 2014-06-11 | 2014-06-11 |
| 2014-06-12 | 2014-06-12 |
| 2014-06-13 | 2014-06-13 |
| 2014-06-23 | 2014-06-23 |
| 2014-06-24 | 2014-06-24 |
| 2014-06-26 | 2014-06-26 |
| 2014-06-28 | 2014-06-28 |
| 2014-06-30 | 2014-06-30 |
| 2014-06-30 | 2014-06-30 |
| 2014-06-30 | 2014-06-30 |
| 2014-04-29 | 2014-04-29 |
| 2014-05-08 | 2014-05-08 |
| 2014-04-30 | 2014-04-30 |
| 2014-07-01 | 2014-07-01 |
| 2014-07-01 | 2014-07-01 |
| 2014-07-02 | 2014-07-02 |
| 2014-07-14 | 2014-07-14 |
| 2014-07-16 | 2014-07-16 |
| 2014-07-19 | 2014-07-19 |
| 2014-07-23 | 2014-07-23 |
| 2014-07-22 | 2014-07-22 |
| 2014-07-23 | 2014-07-23 |
| 2014-08-08 | 2014-08-08 |
| 2014-08-15 | 2014-08-15 |
| 2014-08-27 | 2014-08-27 |
| 2014-09-12 | 2014-09-12 |
| 2014-09-16 | 2014-09-16 |
| 2014-09-22 | 2014-09-22 |
| 2014-09-15 | 2014-09-15 |
| 2014-09-18 | 2014-09-18 |
| 2014-09-24 | 2014-09-24 |
| 2014-09-19 | 2014-09-19 |
| 2014-09-24 | 2014-09-24 |
| 2014-09-30 | 2014-09-30 |
| 2014-09-25 | 2014-09-25 |
| 2014-10-01 | 2014-10-01 |
| 2014-10-07 | 2014-10-07 |
| 2014-10-09 | 2014-10-09 |
| 2014-10-10 | 2014-10-10 |
| 2014-10-20 | 2014-10-20 |
| 2014-10-15 | 2014-10-15 |
| 2014-10-23 | 2014-10-23 |
| 2014-10-27 | 2014-10-27 |
| 2014-10-31 | 2014-10-31 |
| 2014-11-25 | 2014-11-25 |
| 2014-12-02 | 2014-12-02 |
| 2014-12-09 | 2014-12-09 |
| 2014-11-05 | 2014-11-05 |
| 2014-11-14 | 2014-11-14 |
| 2014-11-15 | 2014-11-15 |
| 2014-11-22 | 2014-11-22 |
| 2014-12-02 | 2014-12-02 |
| 2014-12-04 | 2014-12-04 |
| 2015-01-13 | 2015-01-13 |
| 2015-01-16 | 2015-01-16 |
| 2015-01-20 | 2015-01-20 |
| 2015-02-02 | 2015-02-02 |
| 2015-01-28 | 2015-01-28 |
| 2015-02-05 | 2015-02-05 |
| 2014-12-19 | 2014-12-19 |
| 2014-12-22 | 2014-12-22 |
| 2015-01-07 | 2015-01-07 |
| 2015-01-09 | 2015-01-09 |
| 2015-01-21 | 2015-01-21 |
| 2015-01-15 | 2015-01-15 |
| 2015-01-27 | 2015-01-27 |
| 2015-02-09 | 2015-02-09 |
| 2015-02-13 | 2015-02-13 |
| 2015-02-14 | 2015-02-14 |
| 2015-02-19 | 2015-02-19 |
| 2015-02-17 | 2015-02-17 |
| 2015-02-24 | 2015-02-24 |
| 2015-03-04 | 2015-03-04 |
| createdOn | publishedOn |
|---|
Last refresh: Never
%md Create a DataFrame, `databricksBlog2DF` that contains the original columns plus the new `publishedOn` column obtained from flattening the dates column.
Create a DataFrame, databricksBlog2DF that contains the original columns plus the new publishedOn column obtained
from flattening the dates column.
Last refresh: Never
databricksBlog2DF.printSchema()
%md Both `createdOn` and `publishedOn` are stored as strings. Cast those values to SQL timestamps: In this case, use a single `select` method to: 0. Cast `dates.publishedOn` to a `timestamp` data type 0. "Flatten" the `dates.publishedOn` column to just `publishedOn`
Both createdOn and publishedOn are stored as strings.
Cast those values to SQL timestamps:
In this case, use a single select method to:
- Cast
dates.publishedOnto atimestampdata type - "Flatten" the
dates.publishedOncolumn to justpublishedOn
Last refresh: Never
from pyspark.sql.functions import to_timestamp display(databricksBlogDF.select("title",to_timestamp("dates.publishedOn","yyyy-MM-dd").alias("publishedOn")))
| MapR Integrates the Complete Apache Spark Stack | 2014-04-10T00:00:00.000+0000 |
| Apache Spark 0.9.1 Released | 2014-04-10T00:00:00.000+0000 |
| Application Spotlight: Alpine Data Labs | 2014-04-01T00:00:00.000+0000 |
| Spark SQL: Manipulating Structured Data Using Apache Spark | 2014-03-27T00:00:00.000+0000 |
| Apache Spark 0.9.0 Released | 2014-02-04T00:00:00.000+0000 |
| Apache Spark In MapReduce (SIMR) | 2014-01-02T00:00:00.000+0000 |
| Sharethrough Uses Apache Spark Streaming to Optimize Bidding in Real Time | 2014-03-26T00:00:00.000+0000 |
| Apache Spark: A Delight for Developers | 2014-03-21T00:00:00.000+0000 |
| Databricks announces "Certified on Apache Spark" Program | 2014-03-19T00:00:00.000+0000 |
| Apache Spark Now a Top-level Apache Project | 2014-03-03T00:00:00.000+0000 |
| AMPLab updates the Big Data Benchmark | 2014-02-13T00:00:00.000+0000 |
| Databricks at the O'Reilly Strata Conference 2014 | 2014-02-11T00:00:00.000+0000 |
| Apache Spark and Hadoop: Working Together | 2014-01-22T00:00:00.000+0000 |
| Apache Spark 0.8.1 Released | 2013-12-20T00:00:00.000+0000 |
| Highlights From Spark Summit 2013 | 2013-12-19T00:00:00.000+0000 |
| Putting Apache Spark to Use: Fast In-Memory Computing for Your Big Data Applications | 2013-11-22T00:00:00.000+0000 |
| Databricks and Cloudera Partner to Support Apache Spark | 2013-10-29T00:00:00.000+0000 |
| The Growing Apache Spark Community | 2013-10-28T00:00:00.000+0000 |
| Databricks and the Apache Spark Platform | 2013-10-27T00:00:00.000+0000 |
| Databricks and MapR | 2014-04-11T00:00:00.000+0000 |
| Making Apache Spark Easier to Use in Java with Java 8 | 2014-04-15T00:00:00.000+0000 |
| Databricks Announces Apache Spark Training Workshops | 2014-06-02T00:00:00.000+0000 |
| Application Spotlight: Atigeo xPatterns | 2014-05-23T00:00:00.000+0000 |
| Pivotal Hadoop Integrates the Full Apache Spark Stack | 2014-05-23T00:00:00.000+0000 |
| Announcing Apache Spark 1.0 | 2014-05-30T00:00:00.000+0000 |
| Exciting Performance Improvements on the Horizon for Spark SQL | 2014-06-02T00:00:00.000+0000 |
| MicroStrategy "Certified on Apache Spark" | 2014-06-04T00:00:00.000+0000 |
| Application Spotlight: Arimo | 2014-06-11T00:00:00.000+0000 |
| Spark Summit 2014 Brings Together Apache Spark Community | 2014-06-12T00:00:00.000+0000 |
| Application Spotlight: Lightbend | 2014-06-13T00:00:00.000+0000 |
| Application Spotlight: Apervi | 2014-06-23T00:00:00.000+0000 |
| Application Spotlight: Qlik | 2014-06-24T00:00:00.000+0000 |
| Databricks Launches "Certified Apache Spark Distribution" Program | 2014-06-26T00:00:00.000+0000 |
| Application Spotlight: Elasticsearch | 2014-06-28T00:00:00.000+0000 |
| Application Spotlight: Pentaho | 2014-06-30T00:00:00.000+0000 |
| Sparkling Water = H20 + Apache Spark | 2014-06-30T00:00:00.000+0000 |
| Databricks Unveils Apache Spark-Based Cloud Platform; Announces Series B Funding | 2014-06-30T00:00:00.000+0000 |
| Databricks Application Spotlight at Spark Summit 2014 | 2014-04-29T00:00:00.000+0000 |
| Databricks and Datastax | 2014-05-08T00:00:00.000+0000 |
| Databricks Partners with Simba to Deliver Shark ODBC Driver | 2014-04-30T00:00:00.000+0000 |
| Databricks Announces Partnership with SAP | 2014-07-01T00:00:00.000+0000 |
| Integrating Apache Spark and HANA | 2014-07-01T00:00:00.000+0000 |
| Shark, Spark SQL, Hive on Spark, and the future of SQL on Apache Spark | 2014-07-02T00:00:00.000+0000 |
| Databricks: Making Big Data Easy | 2014-07-14T00:00:00.000+0000 |
| New Features in MLlib in Apache Spark 1.0 | 2014-07-16T00:00:00.000+0000 |
| The State of Apache Spark in 2014 | 2014-07-19T00:00:00.000+0000 |
| Scalable Collaborative Filtering with Apache Spark MLlib | 2014-07-23T00:00:00.000+0000 |
| Distributing the Singular Value Decomposition with Apache Spark | 2014-07-22T00:00:00.000+0000 |
| Spark Summit 2014 Highlights | 2014-07-23T00:00:00.000+0000 |
| When Stratio Met Apache Spark: A True Love Story | 2014-08-08T00:00:00.000+0000 |
| Mining Ecommerce Graph Data with Apache Spark at Alibaba Taobao | 2014-08-15T00:00:00.000+0000 |
| Statistics Functionality in Apache Spark 1.1 | 2014-08-27T00:00:00.000+0000 |
| Announcing Apache Spark 1.1 | 2014-09-12T00:00:00.000+0000 |
| Apache Spark 1.1: The State of Spark Streaming | 2014-09-16T00:00:00.000+0000 |
| Apache Spark 1.1: MLlib Performance Improvements | 2014-09-22T00:00:00.000+0000 |
| Application Spotlight: Talend | 2014-09-15T00:00:00.000+0000 |
| Apache Spark 1.1: Bringing Hadoop Input/Output Formats to PySpark | 2014-09-18T00:00:00.000+0000 |
| Databricks Reference Applications | 2014-09-24T00:00:00.000+0000 |
| Databricks and O'Reilly Media launch Certification Program for Apache Spark Developers | 2014-09-19T00:00:00.000+0000 |
| Apache Spark Improves the Economics of Video Distribution at NBC Universal | 2014-09-24T00:00:00.000+0000 |
| Scalable Decision Trees in MLlib | 2014-09-30T00:00:00.000+0000 |
| Guavus Embeds Apache Spark into its Operational Intelligence Platform Deployed at the World’s Largest Telcos | 2014-09-25T00:00:00.000+0000 |
| Apache Spark as a platform for large-scale neuroscience | 2014-10-01T00:00:00.000+0000 |
| Sharethrough Uses Apache Spark Streaming to Optimize Advertisers' Return on Marketing Investment | 2014-10-07T00:00:00.000+0000 |
| Application Spotlight: Trifacta | 2014-10-09T00:00:00.000+0000 |
| Apache Spark the fastest open source engine for sorting a petabyte | 2014-10-10T00:00:00.000+0000 |
| Efficient similarity algorithm now in Apache Spark, thanks to Twitter | 2014-10-20T00:00:00.000+0000 |
| Application Spotlight: Tableau Software | 2014-10-15T00:00:00.000+0000 |
| Spark Summit East - CFP now open | 2014-10-23T00:00:00.000+0000 |
| Application Spotlight: Faimdata | 2014-10-27T00:00:00.000+0000 |
| Hortonworks: A shared vision for Apache Spark on Hadoop | 2014-10-31T00:00:00.000+0000 |
| Application Spotlight: Skytree Infinity | 2014-11-25T00:00:00.000+0000 |
| Application Spotlight: Nube Reifier | 2014-12-02T00:00:00.000+0000 |
| Pearson uses Apache Spark Streaming for next generation adaptive learning platform | 2014-12-09T00:00:00.000+0000 |
| Apache Spark officially sets a new record in large-scale sorting | 2014-11-05T00:00:00.000+0000 |
| Application Spotlight: Bedrock | 2014-11-14T00:00:00.000+0000 |
| The Apache Spark Certified Developer Program | 2014-11-15T00:00:00.000+0000 |
| Samsung SDS uses Apache Spark for prescriptive analytics at large scale | 2014-11-22T00:00:00.000+0000 |
| Databricks to run two massive online courses on Apache Spark | 2014-12-02T00:00:00.000+0000 |
| Application Spotlight: Technicolor Virdata Internet of Things platform | 2014-12-04T00:00:00.000+0000 |
| Databricks Expands Bay Area Presence, Moves HQ to San Francisco | 2015-01-13T00:00:00.000+0000 |
| Apache Spark Certified Developer exams available online! | 2015-01-16T00:00:00.000+0000 |
| Spark Summit East 2015 Agenda is Now Available | 2015-01-20T00:00:00.000+0000 |
| An introduction to JSON support in Spark SQL | 2015-02-02T00:00:00.000+0000 |
| Introducing streaming k-means in Apache Spark 1.2 | 2015-01-28T00:00:00.000+0000 |
| Apache Spark selected for Infoworld 2015 Technology of the Year Award | 2015-02-05T00:00:00.000+0000 |
| Announcing Apache Spark 1.2 | 2014-12-19T00:00:00.000+0000 |
| Announcing Apache Spark Packages | 2014-12-22T00:00:00.000+0000 |
| ML Pipelines: A New High-Level API for MLlib | 2015-01-07T00:00:00.000+0000 |
| Spark SQL Data Sources API: Unified Data Access for the Apache Spark Platform | 2015-01-09T00:00:00.000+0000 |
| Random Forests and Boosting in MLlib | 2015-01-21T00:00:00.000+0000 |
| Improved Fault-tolerance and Zero Data Loss in Apache Spark Streaming | 2015-01-15T00:00:00.000+0000 |
| Big data projects are hungry for simpler and more powerful tools: Survey validates Apache Spark is gaining developer traction! | 2015-01-27T00:00:00.000+0000 |
| "Learning Spark" book available from O'Reilly | 2015-02-09T00:00:00.000+0000 |
| Automatic Labs Selects Databricks for Primary Real-Time Data Processing | 2015-02-13T00:00:00.000+0000 |
| Apache Spark: A review of 2014 and looking ahead to 2015 priorities | 2015-02-14T00:00:00.000+0000 |
| Extending MemSQL Analytics with Apache Spark | 2015-02-19T00:00:00.000+0000 |
| Introducing DataFrames in Apache Spark for Large Scale Data Science | 2015-02-17T00:00:00.000+0000 |
| Databricks at Strata San Jose | 2015-02-24T00:00:00.000+0000 |
| Databricks: From raw data, to insights and data products in an instant! | 2015-03-04T00:00:00.000+0000 |
| title | publishedOn |
|---|
Last refresh: Never
%md Create another DataFrame, `databricksBlog2DF` that contains the original columns plus the new `publishedOn` column obtained from flattening the dates column.
Create another DataFrame, databricksBlog2DF that contains the original columns plus the new publishedOn column obtained
from flattening the dates column.
Last refresh: Never
databricksBlog2DF = databricksBlogDF.withColumn("publishedOn", to_timestamp("dates.publishedOn","yyyy-MM-dd")) display(databricksBlog2DF)
| ["Tomer Shiran (VP of Product Management at MapR)"] | ["Company Blog","Partners"] | <div class="post-meta">This post is guest authored by our friends at MapR, announcing our new partnership to provide enterprise support for Apache Spark as part of MapR's Distribution of Hadoop.</div> <hr /> With over 500 paying customers, my team and I have the opportunity to talk to many organizations that are leveraging Hadoop in production to extract value from big data. One of the most common topics raised by our customers in recent months is Apache Spark. Some customers just want to learn more about the advantages of this technology and the use cases that it addresses, while others are already running it in production with the MapR Distribution. These customers range from the world’s largest cable telcos and retailers to Silicon Valley startups such as Quantifind, which recently talked about its use of Spark on MapR in an <a href="http://www.datameer.com/ceoblog/big-data-brews-with-erich-nachbar/" target="_blank">interview</a> with Stefan Groschupf, CEO of Datameer. Today, I a... | roy | {"createdOn":"2014-04-10","publishedOn":"2014-04-10","tz":"UTC"} | null | 33 | https://databricks.com/blog/2014/04/10/mapr-integrates-spark-stack.html | mapr-integrates-spark-stack | publish | MapR Integrates the Complete Apache Spark Stack | 2014-04-10T00:00:00.000+0000 |
| ["Tathagata Das"] | ["Apache Spark","Engineering Blog","Machine Learning"] | We are happy to announce the availability of <a href="http://spark.apache.org/releases/spark-release-0-9-1.html" target="_blank">Apache Spark 0.9.1</a>! This is a maintenance release with bug fixes, performance improvements, better stability with YARN and improved parity of the Scala and Python API. We recommend all 0.9.0 users to upgrade to this stable release. This is the first release since Spark graduated as a top level Apache project. Contributions to this release came from 37 developers. Visit the <a href="http://spark.apache.org/releases/spark-release-0-9-1.html" target="_blank">release notes</a> for more information about all the improvements and bug fixes. <a href="http://spark.apache.org/downloads.html" target="_blank">Download</a> it and try it out! | tdas | {"createdOn":"2014-04-10","publishedOn":"2014-04-10","tz":"UTC"} | null | 35 | https://databricks.com/blog/2014/04/09/spark-0_9_1-released.html | spark-0_9_1-released | publish | Apache Spark 0.9.1 Released | 2014-04-10T00:00:00.000+0000 |
| ["Steven Hillion"] | ["Company Blog","Partners"] | <div class="post-meta">This post is guest authored by our friends at Alpine Data Labs, part of the 'Application Spotlight' series highlighting innovative applications that are part of the Databricks "Certified on Apache Spark" program.</div> <hr /> Everyone knows how hard it is to recruit engineers and data scientists in Silicon Valley. At <a href="http://www.alpinenow.com" target="_blank">Alpine Data Labs</a>, we think what we’re up to is pretty fun and challenging, but we still have to compete with other start-ups as well as the big internet companies to attract the best talent. One thing that can help is to be able to say that you’re working with the most innovative and powerful technologies. Last year, I was interviewing a talented engineer with a strong background in machine learning. And he said that the one thing he wanted to do above all was to work with Apache Spark. “Will I get to do that at Alpine?” he asked. If it had been even a year earlier, I would have said “Sure…at... | roy | {"createdOn":"2014-04-01","publishedOn":"2014-04-01","tz":"UTC"} | null | 37 | https://databricks.com/blog/2014/03/31/application-spotlight-alpine.html | application-spotlight-alpine | publish | Application Spotlight: Alpine Data Labs | 2014-04-01T00:00:00.000+0000 |
| ["Michael Armbrust","Reynold Xin"] | ["Apache Spark","Engineering Blog"] | Building a unified platform for big data analytics has long been the vision of Apache Spark, allowing a single program to perform ETL, MapReduce, and complex analytics. An important aspect of unification that our users have consistently requested is the ability to more easily import data stored in external sources, such as Apache Hive. Today, we are excited to announce <a href="https://spark.apache.org/docs/latest/sql-programming-guide.html">Spark SQL</a>, a new component recently merged into the Spark repository. Spark SQL brings native support for SQL to Spark and streamlines the process of querying data stored both in RDDs (Spark’s distributed datasets) and in external sources. Spark SQL conveniently blurs the lines between RDDs and relational tables. Unifying these powerful abstractions makes it easy for developers to intermix SQL commands querying external data with complex analytics, all within in a single application. Concretely, Spark SQL will allow developers to: <ul> <li>I... | michael | {"createdOn":"2014-03-27","publishedOn":"2014-03-27","tz":"UTC"} | null | 42 | https://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-data-using-spark-2.html | spark-sql-manipulating-structured-data-using-spark-2 | publish | Spark SQL: Manipulating Structured Data Using Apache Spark | 2014-03-27T00:00:00.000+0000 |
| ["Patrick Wendell"] | ["Apache Spark","Engineering Blog"] | Our goal with Apache Spark is very simple: provide the best platform for computation on big data. We do this through both a powerful core engine and rich libraries for useful analytics tasks. Today, we are excited to announce the release of Apache Spark 0.9.0. This major release extends Spark’s libraries and further improves its performance and usability. Apache Spark 0.9.0 is the largest release to date, with work from 83 contributors, who submitted over 300 patches. Apache Spark 0.9 features significant extensions to the set of standard analytical libraries packaged with Spark. The release introduces GraphX, a library for graph computation that comes with implementations of several standard algorithms, such as PageRank. Spark’s machine learning library (MLlib) has been extended to support Python, using the NumPy numerical library. A Naive Bayes Classifier has also been added to MLlib. Finally, Spark Streaming, which supports near-real-time continuous computation, has added a simplif... | patrick | {"createdOn":"2014-02-04","publishedOn":"2014-02-04","tz":"UTC"} | null | 58 | https://databricks.com/blog/2014/02/03/release-0_9_0.html | release-0_9_0 | publish | Apache Spark 0.9.0 Released | 2014-02-04T00:00:00.000+0000 |
| ["Ali Ghodsi","Ahir Reddy"] | ["Apache Spark","Ecosystem","Engineering Blog"] | Apache Hadoop integration has always been a key goal of Apache Spark and <a href="http://hortonworks.com/wp-content/uploads/2013/06/YARN.png">YARN</a> users have long been able to run <a href="http://spark.incubator.apache.org/docs/latest/running-on-yarn.html">Spark on YARN</a>. However, up to now, it has been relatively hard to run Spark on Hadoop MapReduce v1 clusters, i.e. clusters that do not have YARN installed. Typically, users would have to get permission to install Spark/Scala on some subset of the machines, a process that could be time consuming. Enter <a href="http://databricks.github.io/simr/">SIMR (Spark In MapReduce)</a>, which has been released in conjunction with <a href="https://databricks.com/blog/2013/12/19/release-0_8_1.html">Apache Spark 0.8.1</a>. SIMR allows anyone with access to a Hadoop MapReduce v1 cluster to run Spark out of the box. A user can run Spark directly on top of Hadoop MapReduce v1 without any administrative rights, and without having Spark or Scal... | ali | {"createdOn":"2014-01-02","publishedOn":"2014-01-02","tz":"UTC"} | null | 65 | https://databricks.com/blog/2014/01/01/simr.html | simr | publish | Apache Spark In MapReduce (SIMR) | 2014-01-02T00:00:00.000+0000 |
| ["Russell Cardullo (Data Infrastructure Engineer at Sharethrough)","Michael Ruggiero (Data Infrastructure Engineer at Sharethrough)"] | ["Company Blog","Customers"] | <div class="post-meta">We're very happy to see our friends at Cloudera continue to get the word out about Apache Spark, and their latest blog post is a great example of how users are putting Spark Streaming to use to solve complex problems in real time. Thanks to Russell Cardullo and Michael Ruggiero, Data Infrastructure Engineers at <a href="http://engineering.sharethrough.com/">Sharethrough</a>, for this <a href="http://blog.cloudera.com/blog/2014/03/letting-it-flow-with-spark-streaming/">guest post on Cloudera's blog</a>, which we've cross-posted below</div> <hr /> At Sharethrough, which offers an advertising exchange for delivering in-feed ads, we’ve been running on CDH for the past three years (after migrating from Amazon EMR), primarily for ETL. With the launch of our exchange platform in early 2013 and our desire to optimize content distribution in real time, our needs changed, yet CDH remains an important part of our infrastructure. In mid-2013, we began to examine stream-ba... | roy | {"createdOn":"2014-03-26","publishedOn":"2014-03-26","tz":"UTC"} | null | 2409 | https://databricks.com/blog/2014/03/25/sharethrough-and-spark-streaming.html | sharethrough-and-spark-streaming | publish | Sharethrough Uses Apache Spark Streaming to Optimize Bidding in Real Time | 2014-03-26T00:00:00.000+0000 |
| ["Jai Ranganathan","Matei Zaharia"] | ["Apache Spark","Engineering Blog"] | <div class="post-meta"> This article was cross-posted in the <a href="http://blog.cloudera.com/blog/2014/03/apache-spark-a-delight-for-developers/">Cloudera developer blog</a>. </div> <a href="http://spark.apache.org/">Apache Spark</a> is well known today for its <a href="http://blog.cloudera.com/blog/2013/11/putting-spark-to-use-fast-in-memory-computing-for-your-big-data-applications/">performance benefits</a> over MapReduce, as well as its <a href="http://blog.cloudera.com/blog/2014/03/why-apache-spark-is-a-crossover-hit-for-data-scientists/">versatility</a>. However, another important benefit — the elegance of the development experience — gets less mainstream attention. In this post, you’ll learn just a few of the features in Spark that make development purely a pleasure. <h2>Language Flexibility</h2> Spark natively provides support for a variety of popular development languages. Out of the box, it supports Scala, Java, and Python, with some promising work ongoing <a href="http:/... | matei | {"createdOn":"2014-03-21","publishedOn":"2014-03-21","tz":"UTC"} | null | 2410 | https://databricks.com/blog/2014/03/20/apache-spark-a-delight-for-developers.html | apache-spark-a-delight-for-developers | publish | Apache Spark: A Delight for Developers | 2014-03-21T00:00:00.000+0000 |
| ["Databricks Press Office"] | ["Announcements","Company Blog"] | <strong>BERKELEY, Calif. – March 18, 2014 –</strong> Databricks, the company founded by the creators of Apache Spark that is revolutionizing what enterprises can do with Big Data, today announced the Databricks <a href="/certification/">“Certified on Spark” Program</a> for applications built on top of the Apache Spark platform. This program ensures that certified applications will work with a multitude of commercially supported Spark distributions. “Pioneering application developers that are leveraging the power of Spark have had to choose between two sub-optimal choices: they either have to package Spark platform support with their application or attempt to maintain integration/certification individually with a rapidly increasing set of commercially supported Spark distributions,” said Ion Stoica, Databricks CEO. “The Databricks ‘Certified on Spark’ program enables developers to certify solely against the 100% open-source Apache Spark distribution, and ensures interoperability with A... | roy | {"createdOn":"2014-03-19","publishedOn":"2014-03-19","tz":"UTC"} | null | 2411 | https://databricks.com/blog/2014/03/18/spark-certification.html | spark-certification | publish | Databricks announces "Certified on Apache Spark" Program | 2014-03-19T00:00:00.000+0000 |
| ["Ion Stoica"] | ["Apache Spark","Engineering Blog"] | <div class="blogContent"> We are delighted with the recent <a href="https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces50">announcement</a> of the Apache Software Foundation that <a href="http://spark.apache.org">Apache Spark</a> has become a top-level Apache project. This is a recognition of the fantastic work done by the Spark open source community, which now counts over 140 developers from 30+ companies. In short time, Spark has become an increasingly popular solution for numerous big data applications, including machine learning, interactive queries, and stream processing. Spark now is an integral part of the Hadoop ecosystem, with many organizations employing Spark to perform sophisticated processing on their Hadoop data. At Databricks we are looking forward to continuing our work with the open source community to accelerate the development and adoption of Apache Spark. Currently employing the lead developers and creators of many of the components... | ion | {"createdOn":"2014-03-03","publishedOn":"2014-03-03","tz":"UTC"} | null | 2412 | https://databricks.com/blog/2014/03/02/spark-apache-top-level-project.html | spark-apache-top-level-project | publish | Apache Spark Now a Top-level Apache Project | 2014-03-03T00:00:00.000+0000 |
| ["Ahir Reddy","Reynold Xin"] | ["Apache Spark","Engineering Blog"] | The AMPLab at UC Berkeley, with help from Databricks, recently released an update to the <a href="https://amplab.cs.berkeley.edu/benchmark/">Big Data Benchmark</a>. This benchmark uses Amazon EC2 to compare performance of five popular SQL query engines in the Big Data ecosystem on common types of queries, which can be reproduced through publicly available scripts and datasets. In the past year, the community has invested heavily in performance optimizations of query engines. We are glad to see that all projects have evolved in this area. Although the queries used in the benchmark are simple, we are proud that Shark remains one of the fastest engines for these workloads, and has improved significantly since the last run. While this benchmark reaffirms Shark as a highly performant SQL query engine, we are working hard at Databricks to push the boundaries further. Stay tuned for some exciting news we will share soon with the community. <ul> <li><a href="https://amplab.cs.berkeley.edu/b... | rxin | {"createdOn":"2014-02-13","publishedOn":"2014-02-13","tz":"UTC"} | null | 2413 | https://databricks.com/blog/2014/02/12/big-data-benchmark.html | big-data-benchmark | publish | AMPLab updates the Big Data Benchmark | 2014-02-13T00:00:00.000+0000 |
| ["Pat McDonough"] | ["Company Blog","Events"] | The Databricks team is excited to take part in a number of activities throughout the 2014 O’Reilly Strata Conference in Santa Clara. From hands-on training, to office hours, to several talks (including a keynote), there are plenty of chances for attendees to learn how Apache Spark is bringing ease of use and outstanding performance to your big data. The schedule for the Databricks team includes: <ul> <li><a href="http://ampcamp.berkeley.edu/4/">AMPCamp4</a>, Hosted at Strata</li> <li><a href="http://strataconf.com/strata2014/public/content/office-hours">Office Hours</a> on Wednesday at 5:45pm</li> <li><a href="http://strataconf.com/strata2014/public/schedule/detail/33057">How Companies are Using Spark, and Where the Edge in Big Data Will Be</a>, a keynote talk presented by Matei Zaharia on Thursday at 9:15am</li> <li><a href="http://strataconf.com/strata2014/public/schedule/detail/32375">Querying Petabytes of Data in Seconds with BlinkDB</a>, co-presented by Reynold Xin on Thur... | pat.mcdonough | {"createdOn":"2014-02-11","publishedOn":"2014-02-11","tz":"UTC"} | null | 2414 | https://databricks.com/blog/2014/02/10/strata-santa-clara-2014.html | strata-santa-clara-2014 | publish | Databricks at the O'Reilly Strata Conference 2014 | 2014-02-11T00:00:00.000+0000 |
| ["Ion Stoica"] | ["Apache Spark","Ecosystem","Engineering Blog"] | We are often asked how does <a href="http://spark.incubator.apache.org">Apache Spark</a> fits in the Hadoop ecosystem, and how one can run Spark in a existing Hadoop cluster. This blog aims to answer these questions. First, Spark is intended to <em>enhance</em>, not replace, the Hadoop stack. From day one, Spark was designed to read and write data from and to HDFS, as well as other storage systems, such as HBase and Amazon’s S3. As such, Hadoop users can enrich their processing capabilities by combining Spark with Hadoop MapReduce, HBase, and other big data frameworks. Second, we have constantly focused on making it as easy as possible for <em>every Hadoop user</em> to take advantage of Spark’s capabilities. No matter whether you run Hadoop 1.x or Hadoop 2.0 (YARN), and no matter whether you have administrative privileges to configure the Hadoop cluster or not, there is a way for you to run Spark! In particular, there are three ways to deploy Spark in a Hadoop cluster: standalone, YA... | ion | {"createdOn":"2014-01-22","publishedOn":"2014-01-22","tz":"UTC"} | null | 2415 | https://databricks.com/blog/2014/01/21/spark-and-hadoop.html | spark-and-hadoop | publish | Apache Spark and Hadoop: Working Together | 2014-01-22T00:00:00.000+0000 |
| ["Patrick Wendell"] | ["Apache Spark","Engineering Blog"] | We are happy to announce the release of Apache Spark 0.8.1. In addition to performance and stability improvements, this release adds three new features. First, Spark now supports for the newest versions of YARN (2.2+). Second, the standalone cluster manager supports a high-availability mode in which it can tolerate master failures. Third, shuffles have been optimized to create fewer files, improving shuffle performance drastically in some settings. In conjunction with the Apache Spark 0.8.1 release we are separately releasing <a href="https://databricks.com/blog/2014/01/01/simr.html">Spark In MapReduce (SIMR)</a>, which enables seamlessly running Spark on Hadoop MapReduce v1 clusters without requiring the installation of Scala or Spark. While Apache Spark 0.8.1 is a minor release, it includes these larger features for the benefit of Scala 2.9 users. The next major release of Apache Spark, 0.9.0, will be based on Scala 2.10. This release was a community effort, featuring contribution... | patrick | {"createdOn":"2013-12-20","publishedOn":"2013-12-20","tz":"UTC"} | null | 2416 | https://databricks.com/blog/2013/12/19/release-0_8_1.html | release-0_8_1 | publish | Apache Spark 0.8.1 Released | 2013-12-20T00:00:00.000+0000 |
| ["Andy Konwinski"] | ["Company Blog","Customers","Events"] | Earlier this month we held the <a href="http://spark-summit.org/2013">first Spark Summit</a>, a conference to bring the Apache Spark community together. We are excited to share some statistics and highlights from the event. <ul> <li>450 participants from over 180 companies attended</li> <li>Participants came from 13 countries</li> <li>Spark training was sold out at 200 participants from 80 companies</li> <li>20 organizations sponsored the event, including all major Hadoop platform vendors</li> <li>20 different organizations gave talks</li> </ul> Videos and slides for all talks are now available on the <a href="http://spark-summit.org/2013">Summit 2013 page</a>. The Summit included Keynotes from Databricks, the UC Berkeley AMPLab, and Yahoo, as well as presentations from 18 other companies including Amazon, Red Hat, and Adobe. Talk topics covered a wide range including specialized applications such as mapping and manipulating the brain, product launches, and research projects... | andy | {"createdOn":"2013-12-19","publishedOn":"2013-12-19","tz":"UTC"} | null | 2417 | https://databricks.com/blog/2013/12/18/spark-summit-2013-follow-up.html | spark-summit-2013-follow-up | publish | Highlights From Spark Summit 2013 | 2013-12-19T00:00:00.000+0000 |
| ["Pat McDonough"] | ["Apache Spark","Engineering Blog"] | [sidenote]A version of this post appears on the <a href="http://blog.cloudera.com/blog/2013/11/putting-spark-to-use-fast-in-memory-computing-for-your-big-data-applications/">Cloudera Blog</a>.[/sidenote] <hr/> Apache Hadoop has revolutionized big data processing, enabling users to store and process huge amounts of data at very low costs. MapReduce has proven to be an ideal platform to implement complex batch applications as diverse as sifting through system logs, running ETL, computing web indexes, and powering personal recommendation systems. However, its reliance on persistent storage to provide fault tolerance and its one-pass computation model make MapReduce a poor fit for low-latency applications and iterative computations, such as machine learning and graph algorithms. Apache Spark addresses these limitations by generalizing the MapReduce computation model, while dramatically improving performance and ease of use. <h2 id="fast-and-easy-big-data-processing-with-spark">Fast and ... | pat.mcdonough | {"createdOn":"2013-11-22","publishedOn":"2013-11-22","tz":"UTC"} | null | 2418 | https://databricks.com/blog/2013/11/21/putting-spark-to-use.html | putting-spark-to-use | publish | Putting Apache Spark to Use: Fast In-Memory Computing for Your Big Data Applications | 2013-11-22T00:00:00.000+0000 |
| ["Ion Stoica"] | ["Company Blog","Partners"] | Today, Cloudera announced that it will distribute and support Apache Spark. We are very excited about this announcement, and what it brings to the Spark platform and the open source community. So what does this announcement mean for Spark? First, it validates the maturity of the Spark platform. Started as a research project at UC Berkeley in 2009, Spark is the first general purpose cluster computing engine that can run sophisticated computations at memory speeds on Hadoop clusters. Spark started with the goal of providing efficient support for iterative algorithms (such as machine learning) and interactive queries, workloads not well supported by MapReduce. Since then, Spark has grown to support other applications such as streaming, and has gained rapid industry adoption. Today, Spark is used in production by numerous companies, and it counts on an ever growing open source community with over 90 contributors from 25 companies. Second, it will make the Spark platform available to a wi... | ion | {"createdOn":"2013-10-29","publishedOn":"2013-10-29","tz":"UTC"} | null | 2419 | https://databricks.com/blog/2013/10/28/databricks-and-cloudera-partner-to-support-spark.html | databricks-and-cloudera-partner-to-support-spark | publish | Databricks and Cloudera Partner to Support Apache Spark | 2013-10-29T00:00:00.000+0000 |
| ["Matei Zaharia"] | ["Announcements","Company Blog"] | This year has seen unprecedented growth in both the user and contributor communities around <a href="http://spark.incubator.apache.org">Apache Spark</a>. This rapid growth validates the tremendous potential of the platform, and shows the great excitement around it. While Spark started as a research project by a few grad students at UC Berkeley in 2009, today <strong>over 90 developers from 25 companies have contributed to Spark</strong>. This is not counting contributors to Shark (Hive on Spark), of which there are 25. Indeed, out of the many new big data engines created in the past few years, <strong>Spark has the largest development community after Hadoop MapReduce</strong>. We believe that new components in the project, like <a href="http://spark.incubator.apache.org/docs/latest/streaming-programming-guide.html">Spark Streaming</a> and <a href="http://spark.incubator.apache.org/docs/latest/mllib-guide.html">MLlib</a>, will only increase this growth. <h2>Growth by Numbers</h2> To gi... | matei | {"createdOn":"2013-10-28","publishedOn":"2013-10-28","tz":"UTC"} | null | 2420 | https://databricks.com/blog/2013/10/27/the-growing-spark-community.html | the-growing-spark-community | publish | The Growing Apache Spark Community | 2013-10-28T00:00:00.000+0000 |
| ["Ion Stoica","Matei Zaharia"] | ["Announcements","Company Blog"] | When we announced that the original team behind <a href="http://spark.incubator.apache.org">Apache Spark</a> is starting a company around the project, we got a lot of excited questions. What areas will the company focus on, and what will it mean for the open source project? Today, in our first blog post at Databricks, we’re happy to share some of our goals, and say a little about what we’re doing next with Spark. To start with, our mission at Databricks is simple: we want to build the very best computing platform for extracting value from data. Big data is a tremendous opportunity that is still largely untapped, and we’ve been working for the past six years to transform what can be done with it. Going forward, we are fully committed to building out the open source Apache Spark platform to achieve this goal. <h2 id="how-we-think-about-big-data-speed-and-sophistication">How We Think about Big Data: Speed and Sophistication</h2> In the past few years, open source technologies like Hadoop... | ion | {"createdOn":"2013-10-27","publishedOn":"2013-10-27","tz":"UTC"} | null | 2421 | https://databricks.com/blog/2013/10/27/databricks-and-the-apache-spark-platform.html | databricks-and-the-apache-spark-platform | publish | Databricks and the Apache Spark Platform | 2013-10-27T00:00:00.000+0000 |
| ["Arsalan Tavakoli-Shiraji"] | ["Company Blog","Partners"] | Today, MapR announced that it will distribute and support the Apache Spark platform as part of the MapR Distribution for Hadoop in partnership with Databricks. We’re thrilled to start on this journey with MapR for a multitude of reasons. One of our primary goals at Databricks is to drive broad adoption of Spark and ensure everybody who uses it has a fantastic experience. This partnership will enable all of MapR’s enterprise customers, existing and new, to leverage Spark with the backing of the same great enterprise support available for the rest of MapR’s Hadoop Distribution. As Tomer mentioned in his <a href="/blog/2014/04/10/MapR-Integrates-Spark-Stack.html">blog post</a>, Spark is one of the most common topics in discussions with MapR’s existing customers and many are even already running it in production! A core part of Spark’s value proposition is the ability to easily build a unified end-to-end workflow where critical functions are first class citizens that are seamlessly integ... | arsalan | {"createdOn":"2014-04-11","publishedOn":"2014-04-11","tz":"UTC"} | null | 2461 | https://databricks.com/blog/2014/04/10/partnership-between-databricks-and-mapr.html | partnership-between-databricks-and-mapr | publish | Databricks and MapR | 2014-04-11T00:00:00.000+0000 |
| ["Prashant Sharma","Matei Zaharia"] | ["Apache Spark","Engineering Blog"] | One of Apache Spark’s main goals is to make big data applications easier to write. Spark has always had concise APIs in Scala and Python, but its Java API was verbose due to the lack of function expressions. With the addition of <a href="http://docs.oracle.com/javase/tutorial/java/javaOO/lambdaexpressions.html">lambda expressions</a> in Java 8, we’ve updated Spark’s API to transparently support these expressions, while staying compatible with old versions of Java. This new support will be available in Apache Spark 1.0. <h2 id="a-few-examples">A Few Examples</h2> The following examples show how Java 8 makes code more concise. In our first example, we search a log file for lines that contain “error”, using Spark’s <code>filter</code> and <code>count</code> operations. The code is simple to write, but passing a Function object to <code>filter</code> is clunky: <h5 id="java-7-search-example">Java 7 search example:</h5> <pre>JavaRDD<String> lines = sc.textFile("hdfs://log.txt").filter( n... | matei | {"createdOn":"2014-04-15","publishedOn":"2014-04-15","tz":"UTC"} | null | 12 | https://databricks.com/blog/2014/04/14/spark-with-java-8.html | spark-with-java-8 | publish | Making Apache Spark Easier to Use in Java with Java 8 | 2014-04-15T00:00:00.000+0000 |
| ["Databricks Training Team"] | ["Announcements","Company Blog","Events"] | Databricks is excited to launch its training program, starting with <a title="Spark Training" href="https://databricks.com/training">a series of hands-on Apache Spark workshops</a> designed by the creators of Apache Spark. The first workshop, <em>Introduction to Apache Spark</em>, establishes the fundamentals of using Spark for data exploration, analysis, and building big data applications. This one day workshop is hands-on, covering topics such as: interactively working with Spark's core APIs, learning the key concepts of big data, deploying applications on common Hadoop distributions, and unifying data pipelines with SQL, Streaming, and Machine Learning. Workshops are currently scheduled in New York, San Jose, Austin, and Chicago, with workshops in more cities to come. Visit <a title="Databricks Training" href="https://databricks.com/training">Databricks' training page</a> to find more information and please leave feedback there if you'd like to see a workshop in your area. <ul cla... | pat.mcdonough | {"createdOn":"2014-06-02","publishedOn":"2014-06-02","tz":"UTC"} | null | 273 | https://databricks.com/blog/2014/06/02/databricks-hands-on-technical-workshops.html | databricks-hands-on-technical-workshops | publish | Databricks Announces Apache Spark Training Workshops | 2014-06-02T00:00:00.000+0000 |
| ["Claudiu Barbura (Sr. Dir. of Engineering at Atigeo LLC)"] | ["Company Blog","Partners"] | <div class="post-meta">This post is guest authored by our friends at <a href="http://atigeo.com">Atigeo</a> announcing the certification of their xPatterns offering.</div> <hr /> Here at <a href="http://atigeo.com/">Atigeo</a>, we are always looking for ways to build on, improve, and expand our big data analytics platform, Atigeo xPatterns. More than that, both our development and product management team are focused on big data and on knowing what is right for our customers: data scientists and application developers at companies who are seeking to make the best possible use of their data assets. So we all stay on the lookout for the most useful, advanced, and best-performing set of technologies available. Apache Spark, for us, was a standout: We could see that making a dramatic performance improvement available to our customers and users would mean that xPattern’s analytics, modeling, and machine learning would be more responsive, and that Spark in xPatterns would give our customer... | arsalan | {"createdOn":"2014-05-23","publishedOn":"2014-05-23","tz":"UTC"} | null | 274 | https://databricks.com/blog/2014/05/22/application-spotlight-atigeo-xpatterns.html | application-spotlight-atigeo-xpatterns | publish | Application Spotlight: Atigeo xPatterns | 2014-05-23T00:00:00.000+0000 |
| ["Sarabjeet Chugh (Head of Hadoop Product Management at Pivotal Inc.)"] | ["Company Blog","Partners"] | <div class="post-meta">This post is guest authored by our friends at <a href="http://www.gopivotal.com" target="_blank">Pivotal</a> describing why they’re excited to deliver Apache Spark on their world class Pivotal HD big data analytics platform suite.</div> <hr /> Today, we are excited to announce the immediate availability of the full Apache Spark stack on Pivotal HD. We have been impressed with the rapid adoption of Spark as a replacement for Hadoop’s more traditional processing engines as well as its vibrant ecosystem, and are thrilled to make it possible for Pivotal customers to run Apache Spark on Pivotal HD Hadoop. Just as important is how we’re doing it: Pivotal HD will be part of Databricks’ upcoming certification program – meaning a commitment to provide compatibility with Apache Spark and support the growing ecosystem of Spark applications. <h2>PivotalHD and Spark</h2> Unlike a multi-vendor patchwork of heterogeneous solutions, Pivotal brings together an integrated ful... | arsalan | {"createdOn":"2014-05-23","publishedOn":"2014-05-23","tz":"UTC"} | null | 297 | https://databricks.com/blog/2014/05/23/pivotal-hadoop-integrates-the-full-apache-spark-stack.html | pivotal-hadoop-integrates-the-full-apache-spark-stack | publish | Pivotal Hadoop Integrates the Full Apache Spark Stack | 2014-05-23T00:00:00.000+0000 |
| ["Patrick Wendell"] | ["Apache Spark","Engineering Blog"] | Today, we’re very proud to announce the release of <a title="Spark 1.0.0 Release Notes" href="http://spark.apache.org/releases/spark-release-1-0-0.html">Apache Spark 1.0</a>. Apache Spark 1.0 is a major milestone for the Spark project that brings both numerous new features and strong API compatibility guarantees. The release is also a huge milestone for the Spark developer community: with more than 110 contributors over the past 4 months, it is Spark’s largest release yet, continuing a trend that has quickly made Spark the most active project in the Hadoop ecosystem. <h2>New Features</h2> What features are we most excited about in Apache Spark 1.0? While there are dozens of new features in the release, we’d like to highlight three. <b>Spark SQL</b> The biggest single addition to Apache Spark 1.0 is Spark SQL, a new module that <a title="Spark SQL" href="https://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-data-using-spark-2.html">we’ve previously blogged about</a>... | patrick | {"createdOn":"2014-05-30","publishedOn":"2014-05-30","tz":"UTC"} | null | 502 | https://databricks.com/blog/2014/05/30/announcing-spark-1-0.html | announcing-spark-1-0 | publish | Announcing Apache Spark 1.0 | 2014-05-30T00:00:00.000+0000 |
| ["Michael Armbrust","Zongheng Yang"] | ["Apache Spark","Engineering Blog"] | With <a title="Announcing Spark 1.0" href="https://databricks.com/blog/2014/05/30/announcing-spark-1-0.html">Apache Spark 1.0</a> out the door, we’d like to give a preview of the next major initiatives in the Spark project. Today, the most active component of Spark is <a title="Spark SQL: Manipulating Structured Data Using Spark" href="https://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-data-using-spark-2.html">Spark SQL</a> - a tightly integrated relational engine that inter-operates with the core Spark API. Spark SQL was released in Spark 1.0, and will provide a lighter weight, agile execution backend for future versions of Shark. In this post, we’d like to highlight some of the ways in which tight integration into Scala and Spark provide us powerful tools to optimize query execution with Spark SQL. This post outlines one of the most exciting features, dynamic code generation, and explains what type of performance boost this feature can offer using queries from a... | michael | {"createdOn":"2014-06-02","publishedOn":"2014-06-02","tz":"UTC"} | null | 528 | https://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html | exciting-performance-improvements-on-the-horizon-for-spark-sql | publish | Exciting Performance Improvements on the Horizon for Spark SQL | 2014-06-02T00:00:00.000+0000 |
| ["Michael Hiskey (VP at MicroStrategy Inc.)"] | ["Company Blog","Partners"] | <div class="post-meta">This post is guest authored by our friends at <a href="http://www.microstrategy.com" target="_blank">MicroStrategy</a> describing why they're excited to have their platform "Certified on Apache Spark".</div> <hr /> <h2>The Need for Speed</h2> Over the past few years, we have seen Hadoop emerge as an effective foundation for many organizations’ big data management frameworks, but as the volume and varieties of data increase, speed continues to be a challenge. More and more of our customers are embracing Big Data, and the value of their investment is dependent on (and limited by) how quickly they can take data to action. We’ve been listening to our clients to understand how we can innovate to stay ahead of the curve to help solve these challenges. Apache Spark grabbed our attention because it addresses many of the limitations of Hadoop’s traditional functionality. Plus, Spark is simply impossible to ignore. The active, growing community of developers and enterpri... | arsalan | {"createdOn":"2014-06-04","publishedOn":"2014-06-04","tz":"UTC"} | null | 569 | https://databricks.com/blog/2014/06/04/microstrategy-certified-on-spark.html | microstrategy-certified-on-spark | publish | MicroStrategy "Certified on Apache Spark" | 2014-06-04T00:00:00.000+0000 |
| ["Christopher Nguyen (CEO & Co-Founder of Adatao)"] | ["Company Blog","Partners"] | <div class="post-meta">This post is guest authored by our friends at <a href="http://www.arimo.com" target="_blank">Arimo</a> describing why and how they bet on Apache Spark.</div> <hr /> In early 2012, a group of engineers with background in distributed systems and machine learning came together to form Arimo. We saw a major unsolved problem in the nascent Hadoop ecosystem: it was largely a storage play. Data was sitting passively on HDFS, with very little value being extracted. To be sure, there was MapReduce, Hive, Pig, etc., but value is a strong function of (a) speed of computation, (b) sophistication of logic, and (c) ease of use. While Hadoop ecosystem was being developed well at the substrate, there was enormous opportunities above it left uncaptured. <strong>On speed:</strong> we had seen data move at-scale and at enormously faster rates in systems like Dremel and PowerDrill at Google. It enabled interactive behavior simply not available to Hadoop users. Without doubt, we k... | arsalan | {"createdOn":"2014-06-11","publishedOn":"2014-06-11","tz":"UTC"} | null | 585 | https://databricks.com/blog/2014/06/11/application-spotlight-arimo.html | application-spotlight-arimo | publish | Application Spotlight: Arimo | 2014-06-11T00:00:00.000+0000 |
| ["Databricks Press Office"] | ["Company Blog","Events"] | <ul> <li>Three-Day Event in San Francisco Invites Attendees to Gain Insights from the Leading Organizations in Big Data</li> <li>Keynote Speakers Include Executives from Databricks, Cloudera, MapR, DataStax, Jawbone and More</li> <li>Spark Summit Features Different Tracks for Applications, Development, Data Science and Research</li> </ul> BERKELEY, Calif.--(BUSINESS WIRE)-- Databricks and the sponsors of Spark Summit 2014 today announced the full agenda for the summit, including a host of exciting keynotes and community talks. The event will be held June 30–July 2, 2014, at The Westin St. Francis in San Francisco. Spark Summit 2014 arrives at an exciting time for the Apache Spark platform, which has become the most active open source project in the Hadoop ecosystem with more than 200 contributors in the past year. Now available in all major Hadoop distributions, Spark has fostered a fast-growing community on the strength of its technical capabilities, which make big data... | scott | {"createdOn":"2014-06-12","publishedOn":"2014-06-12","tz":"UTC"} | null | 609 | https://databricks.com/blog/2014/06/11/spark-summit-2014-brings-together-apache-spark-community.html | spark-summit-2014-brings-together-apache-spark-community | publish | Spark Summit 2014 Brings Together Apache Spark Community | 2014-06-12T00:00:00.000+0000 |
| ["Dean Wampler (Typesafe)"] | ["Company Blog","Partners"] | <div class="post-meta">This post is guest authored by our friends at <a href="http://www.lightbend.com" target="_blank">Lightbend</a> after having their Lightbend Activator Apache Spark templates be "Certified on Apache Spark".</div> <hr /> <h2>Apache Spark and the Lightbend Reactive Platform: A Match Made in Heaven</h2> When I started working with Hadoop several years ago, it was frustrating to find that writing Hadoop jobs was hard to do. If your problem fits a query model, then <a title="Hive" href="http://hive.apache.org" target="_blank">Hive</a> provides a SQL-based scripting tool. For many common dataflow problems, <a href="http://pig.apache.org" target="_blank">Pig</a> provides useful abstractions, but it isn't a full-fledged, "Turing-complete" language. Otherwise, you had to use the low-level <a href="http://wiki.apache.org/hadoop/MapReduce" target="_blank">Hadoop MapReduce</a> API. Some third-party APIs exist that wrap the MapReduce API, such as <a href="http://cascading.org... | arsalan | {"createdOn":"2014-06-13","publishedOn":"2014-06-13","tz":"UTC"} | null | 628 | https://databricks.com/blog/2014/06/13/application-spotlight-lightbend.html | application-spotlight-lightbend | publish | Application Spotlight: Lightbend | 2014-06-13T00:00:00.000+0000 |
| ["Hari Kodakalla (EVP at Apervi Inc.)"] | ["Company Blog","Partners"] | <div class="post-meta">This post is guest authored by our friends at <a href="http://www.apervi.com" target="_blank">Apervi</a> after having their Conflux Director™ application be "Certified on Apache Spark".</div> <hr /> <h2>Big Data on Steroids with Apache Spark</h2> As big data takes center stage in the new data explosion, Hadoop has emerged as one the leading technologies addressing the challenges in the space. As the data processing needs of enterprises are growing newer technologies like Apache Spark have emerged as significant options that consistently offer expanded capabilities for the big data space. As these enterprise needs are met, so is the increased appetite for faster processing, low latency requirements for high velocity data and an iterative demand for processing where leading technologies like Hadoop fall short of expectations or at times seem cumbersome to implement due to its inherent design. Delivering on this growing need of enterprises is where Spark plays a ... | arsalan | {"createdOn":"2014-06-23","publishedOn":"2014-06-23","tz":"UTC"} | null | 643 | https://databricks.com/blog/2014/06/23/application-spotlight-apervi.html | application-spotlight-apervi | publish | Application Spotlight: Apervi | 2014-06-23T00:00:00.000+0000 |
| ["Bill Kehoe (Big Data Architect at Qlik)"] | ["Company Blog","Partners"] | <div class="post-meta">This post is guest authored by our friends at <a href="http://www.qlik.com" target="_blank">Qlik</a> describing how Apache Spark enables the full power of QlikView, recently Certified on Apache Spark, and its Associative Experience feature over the entire HDFS data set.</div> <hr /> <h2>The Power of Qlik</h2> Qlik provides software and services that help make understanding data a natural part of how people make decisions. Our product, QlikView, is the leading Business Discovery platform that incorporates a unique, associative experience that empowers business users to follow their own path to formulate and answer questions that lead to better decisions. Traditional, query-based BI tools force users thru pre-defined navigation paths which limit the kinds of questions that can be answered and require costly and time consuming revisions to address evolving business needs. In contrast, when a user selects data items using QlikView, all the fields and charts are imm... | arsalan | {"createdOn":"2014-06-24","publishedOn":"2014-06-24","tz":"UTC"} | null | 651 | https://databricks.com/blog/2014/06/24/application-spotlight-qlik.html | application-spotlight-qlik | publish | Application Spotlight: Qlik | 2014-06-24T00:00:00.000+0000 |
| ["Databricks Press Office"] | ["Announcements","Company Blog"] | <em>Certified distributions maintain compatibility with open source Apache Spark distribution and thus support the growing ecosystem of Apache Spark applications</em> <hr /> <strong>BERKELEY, Calif. -- June 26, 2014 --</strong> Databricks, the company founded by the creators of Apache Spark, the next generation Big Data engine, today announced the <a href="https://databricks.com/spark/certification/certified-spark-distribution" target="_blank">“Certified Spark Distribution” </a>program for vendors with a commercial Spark distribution. Certification indicates that the vendor’s Spark distribution is compatible with the open source Apache Spark distribution, enabling “Certified on Spark” applications - certified to work with Apache Spark - to run on the vendor’s Spark distribution out-of-the-box. “One of Databricks’ goals is to ensure users have a fantastic experience. Our belief is that having the community work together to maintain compatibility and therefore facilitate a vibrant app... | arsalan | {"createdOn":"2014-06-26","publishedOn":"2014-06-26","tz":"UTC"} | null | 703 | https://databricks.com/blog/2014/06/26/databricks-launches-certified-spark-distribution-program.html | databricks-launches-certified-spark-distribution-program | publish | Databricks Launches "Certified Apache Spark Distribution" Program | 2014-06-26T00:00:00.000+0000 |
| ["Costin Leau (Engineer at Elasticsearch)"] | ["Company Blog","Partners"] | <div class="post-meta">This post is guest authored by our friends at <a href="http://www.elasticsearch.com" target="_blank">Elasticsearch</a> announcing Elasticsearch is now "Certified on Apache Spark", the first step in a collaboration to provide tighter integration between Elasticsearch and Spark.</div> <hr /> <h2>Elasticsearch Now “Certified on Spark”</h2> Helping businesses get insights out of their data, fast, is core to the mission of Elasticsearch. Being able to live wherever a business stores their data is obviously critical to that mission, and Hadoop is one of the leaders in providing a way for businesses to store massive amounts of data at scale. Over the course of the past year, we have been working hard to bring the power of our real-time search and analytics engine to the Hadoop ecosystem. Our Hadoop connector, Elasticsearch for Apache Hadoop, is compatible with the top three Hadoop distributions – Cloudera, Hortonworks and MapR – and today has achieved another exciting... | arsalan | {"createdOn":"2014-06-28","publishedOn":"2014-06-28","tz":"UTC"} | null | 713 | https://databricks.com/blog/2014/06/27/application-spotlight-elasticsearch.html | application-spotlight-elasticsearch | publish | Application Spotlight: Elasticsearch | 2014-06-28T00:00:00.000+0000 |
| ["Jake Cornelius (SVP of Product Management at Pentaho)"] | ["Company Blog","Partners"] | [sidenote]This post is guest authored by our friends at <a href="http://www.pentaho.com" target="_blank">Pentaho</a> after having their data integration and analytics platform <a href="http://www.databricks.com/certification" target="_blank">“Certified on Apache Spark.”</a>[/sidenote] <hr /> One of Pentaho’s great passions is to empower organizations to take advantage of amazing innovations in <a href="http://www.pentaho.com/what-is-big-data" target="_blank">Big Data</a> to solve new challenges using the existing skill sets they have in their organizations today. Our Pentaho Labs prototyping and innovation efforts around natively integrating data engineering and analytics with Big Data platforms like <a href="http://www.pentaho.com/what-is-hadoop" target="_blank">Hadoop</a> and <a href="http://www.pentaho.com/storm" target="_blank">Storm</a> have already led dozens of customers to deploy next-generation Big Data solutions. Examples of these solutions include <a href="http://www.pent... | arsalan | {"createdOn":"2014-06-30","publishedOn":"2014-06-30","tz":"UTC"} | null | 720 | https://databricks.com/blog/2014/06/30/application-spotlight-pentaho.html | application-spotlight-pentaho | publish | Application Spotlight: Pentaho | 2014-06-30T00:00:00.000+0000 |
| ["SriSatish Ambati (CEO of 0xData)"] | ["Company Blog","Partners"] | <div class="post-meta">This post is guest authored by our friends at <a href="http://www.0xdata.com" target="_blank">0xData</a> discussing the release of Sparkling Water - the integration of their H20 offering with the Apache Spark platform.</div> <hr /> <h3>H20 – The Killer-App on Apache Spark</h3> <img class="aligncenter size-full wp-image-62" src="https://databricks.com/wp-content/uploads/2014/06/Spark-+-H20.png" width="472" /> In-memory big data has come of age. The Apache Spark platform, with its elegant API, provides a unified platform for building data pipelines. H2O has focused on scalable machine learning as the API for big data applications. Spark + H2O combines the capabilities of H2O with the Spark platform – converging the aspirations of data science and developer communities. H2O is the Killer-Application for Spark. <img class="aligncenter size-full wp-image-62" src="https://databricks.com/wp-content/uploads/2014/06/H20-the-Killer-App.png" width="472" /> <h3>Backdrop<... | arsalan | {"createdOn":"2014-06-30","publishedOn":"2014-06-30","tz":"UTC"} | null | 732 | https://databricks.com/blog/2014/06/30/sparkling-water-h20-spark.html | sparkling-water-h20-spark | publish | Sparkling Water = H20 + Apache Spark | 2014-06-30T00:00:00.000+0000 |
| ["Databricks Press Office"] | ["Announcements","Company Blog"] | <ul> <li>Databricks Cloud Allows Users to Get Value from Apache Spark without the Challenges Normally Associated with Big Data Infrastructure</li> <li>Ease-of-Use of Turnkey Solution Brings the Power of Spark to a Wider Audience and Fuels the Growth of the Spark Ecosystem</li> <li>Funding Led by NEA with Follow-on Investment from Andreessen Horowitz</li> </ul> <strong>Berkeley, Calif. (June 30, 2014)</strong>—Databricks, the company founded by the creators of Apache Spark—the powerful open-source processing engine that provides blazingly fast and sophisticated analytics—announced today the launch of <a title="Databricks Cloud" href="https://databricks.com/cloud">Databricks Cloud</a>, a cloud platform built around Apache Spark. In addition to this launch, the company is announcing the close of $33 million in series B funding led by New Enterprise Associates (NEA) with follow-on investment from Andreessen Horowitz. “Getting the full value out of their Big Data investments is still... | arsalan | {"createdOn":"2014-06-30","publishedOn":"2014-06-30","tz":"UTC"} | null | 768 | https://databricks.com/blog/2014/06/30/databricks-unveils-spark-based-cloud-platform.html | databricks-unveils-spark-based-cloud-platform | publish | Databricks Unveils Apache Spark-Based Cloud Platform; Announces Series B Funding | 2014-06-30T00:00:00.000+0000 |
| ["Arsalan Tavakoli-Shiraji"] | ["Company Blog","Events"] | At Databricks, we’ve been thrilled to see the rapid pace of adoption of Apache Spark, as it has been embraced by an increasing number of enterprise vendors and has grown to be the most active open source project in the Hadoop ecosystem. We also know that a critical piece of enabling enterprises to unlock its potential is a strong ecosystem of applications built on top of or integrated with Spark. We launched the <a href="http://www.databricks.com/certification/">“Certified on Apache Spark”</a> program to support these application developer efforts, and have been blown away at the diverse set of applications being built on top of Spark, and want this great work to be exposed to the broader community. In that light, this year’s Spark Summit will have an “Application Spotlight” segment that will highlight some of the best we’ve seen. Read on for details on how to apply and what selection entails. All applications eligible (even if not yet certified) for the Databricks “Certified on Spar... | arsalan | {"createdOn":"2014-04-29","publishedOn":"2014-04-29","tz":"UTC"} | null | 2462 | https://databricks.com/blog/2014/04/28/databricks-application-spotlight-at-spark-summit-2014.html | databricks-application-spotlight-at-spark-summit-2014 | publish | Databricks Application Spotlight at Spark Summit 2014 | 2014-04-29T00:00:00.000+0000 |
| ["Arsalan Tavakoli-Shiraji"] | ["Company Blog","Partners"] | <p>Today, Datastax and Databricks announced a partnership in which Apache Spark becomes an integral part of the Datastax offering, tightly integrated with Cassandra. We’re very excited to be embarking on this journey with Datastax for a multitude of reasons:</p> <h2 id="integrating-operational-systems-with-analytics">Integrating operational systems with analytics</h2> <p>One of the use cases that we’ve increasingly been asked about by Spark users is the ability to create a closed loop system: perform advanced analytics directly on operational data that is then fed back into the operational system to drive necessary adaptation. The tight integration of Cassandra and Spark will enable users to achieve this goal by leveraging Cassandra as the high-performance transactional database that powers online applications and Spark as a next generation processing engine that can deliver deeper insights, faster while seamlessly moving between the two.</p> <h2 id="spark-beyond-hadoop">Spark beyond... | arsalan | {"createdOn":"2014-05-08","publishedOn":"2014-05-08","tz":"UTC"} | null | 2463 | https://databricks.com/blog/2014/05/08/databricks-and-datastax.html | databricks-and-datastax | publish | Databricks and Datastax | 2014-05-08T00:00:00.000+0000 |
| ["Databricks Press Office"] | ["Announcements","Company Blog"] | <p><strong>VANCOUVER, BC. – April 30, 2014 –</strong> Simba Technologies Inc., the industry’s expert for Big Data connectivity, announced today that Databricks has licensed Simba’s ODBC Driver as its standards-based connectivity solution for Shark, the SQL front-end for Apache Spark, the next generation Big Data processing engine. Founded by the creators of Apache Spark and Shark, Databricks is developing cutting-edge systems to enable enterprises to discover deeper insights, faster.</p> <p>“We believe that Big Data is a tremendous opportunity that is still largely untapped, and we are working to revolutionize what organizations can do with it,” says Ion Stoica, Chief Executive Officer at Databricks, and Professor of Computer Science at UC Berkeley. “As part of this mission, we understand that BI tools will continue to be a key medium for consuming data and analytics and are excited to announce the availability of an enterprise-grade connectivity option for users of BI tools. ... | roy | {"createdOn":"2014-04-30","publishedOn":"2014-04-30","tz":"UTC"} | null | 2464 | https://databricks.com/blog/2014/04/30/databricks-partners-with-simba-to-deliver-shark-odbc-driver.html | databricks-partners-with-simba-to-deliver-shark-odbc-driver | publish | Databricks Partners with Simba to Deliver Shark ODBC Driver | 2014-04-30T00:00:00.000+0000 |
| ["Databricks Press Office"] | ["Announcements","Company Blog","Partners"] | <strong>SAN FRANCISCO — July 1, 2014</strong> — Databricks, the company founded by the creators of Apache Spark – the popular open-source processing engine - today announced a new partnership with <a href="http://www.sap.com" target="_blank">SAP (NYSE: SAP)</a> and to deliver a Databricks-certified Apache Spark distribution offering for the SAP HANA® platform. The full production-ready distribution offering, based on Apache Spark 1.0, is deployable in the cloud or on premise and available for immediate download from SAP at no cost at <a href="http://spr.ly/SAP_and_Spark" target="_blank">spr.ly/SAP_and_Spark</a>. The announcement was made at the Spark Summit 2014, being held June 30 – July 2 in San Francisco. The Databricks-certified distribution offering for SAP HANA contains the Spark processing engine that works with any Hadoop distribution out of the box, providing a more complete data store and processing layer for Hadoop. Certified by Databricks to be compatible with the Apache ... | arsalan | {"createdOn":"2014-07-01","publishedOn":"2014-07-01","tz":"UTC"} | null | 782 | https://databricks.com/blog/2014/07/01/databricks-announces-partnership-with-sap.html | databricks-announces-partnership-with-sap | publish | Databricks Announces Partnership with SAP | 2014-07-01T00:00:00.000+0000 |
| ["Arsalan Tavakoli-Shiraji"] | ["Company Blog","Partners"] | This morning SAP released its own “Certified Spark Distribution” as part of a brand new partnership announced between Databricks and SAP. We’re thrilled to be embarking on this journey with them, not just because of what it means for Databricks as a company, but just as importantly because of what it means for Apache Spark and the Spark community. <h2>Access to the full corpus of data</h2> Fundamentally, every enterprise's big data vision is to convert data into value; a core ingredient in this quest is the availability of the data that needs to be mined for insights. Although the growth in volume of data sitting in HDFS has been incredible and continues to grow exponentially, much of this has been contextual data - e.g., social data, click-stream data, sensor data, logs, 3rd party data sources - and historical data. Real-time operational data - e.g., data from foundational enterprise applications such as ERP (Enterprise Resource Planning), CRM (Customer Relationship Management), and S... | arsalan | {"createdOn":"2014-07-01","publishedOn":"2014-07-01","tz":"UTC"} | null | 785 | https://databricks.com/blog/2014/07/01/integrating-spark-and-hana.html | integrating-spark-and-hana | publish | Integrating Apache Spark and HANA | 2014-07-01T00:00:00.000+0000 |
| ["Reynold Xin"] | ["Apache Spark","Engineering Blog"] | With the introduction of Spark SQL and the new Hive on Apache Spark effort (<a href="https://issues.apache.org/jira/browse/HIVE-7292">HIVE-7292</a>), we get asked a lot about our position in these two projects and how they relate to Shark. At the <a href="http://spark-summit.org/2014">Spark Summit</a> today, we announced that we are ending development of Shark and will focus our resources towards Spark SQL, which will provide a superset of Shark’s features for existing Shark users to move forward. In particular, Spark SQL will provide both a seamless upgrade path from Shark 0.9 server and new features such as integration with general Spark programs. <img class="alignnone wp-image-818 size-large" src="https://databricks.com/wp-content/uploads/2014/07/sql-directions-1024x691.png" alt="Future of SQL on Spark" width="400" /> <h2>Shark</h2> When the Shark project started 3 years ago, Hive (on MapReduce) was the only choice for SQL on Hadoop. Hive compiled SQL into scalable MapReduce jobs a... | rxin | {"createdOn":"2014-07-02","publishedOn":"2014-07-02","tz":"UTC"} | null | 796 | https://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html | shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark | publish | Shark, Spark SQL, Hive on Spark, and the future of SQL on Apache Spark | 2014-07-02T00:00:00.000+0000 |
| ["Ion Stoica"] | ["Company Blog","Product"] | Our vision at Databricks is to <strong>make big data easy</strong> so that we enable <strong>every</strong> organization to turn its data into value. At Spark Summit 2014, we were very excited to unveil <a href="https://databricks.com/cloud" target="_blank">Databricks</a>, our first product towards fulfilling this vision. In this post, I’ll briefly go over the challenges that data scientists and data engineers face today when working with big data, and then show how Databricks addresses these challenges. <h2>Today’s Big Data Challenges</h2> While the promise of big data to <a href="http://spark-summit.org/2014/talk/using-spark-to-generate-analytics-for-international-cable-tv-video-distribution" target="_blank">improve businesses</a>, <a href="http://spark-summit.org/2014/talk/david-patterson" target="_blank">save lives</a>, and <a href="http://spark-summit.org/2014/talk/A-platform-for-large-scale-neuroscience" target="_blank">advance science</a> is becoming more and more real, analyzi... | ion | {"createdOn":"2014-07-14","publishedOn":"2014-07-14","tz":"UTC"} | null | 865 | https://databricks.com/blog/2014/07/14/databricks-cloud-making-big-data-easy.html | databricks-cloud-making-big-data-easy | publish | Databricks: Making Big Data Easy | 2014-07-14T00:00:00.000+0000 |
| ["Xiangrui Meng"] | ["Apache Spark","Engineering Blog","Machine Learning"] | MLlib is an Apache Spark component focusing on machine learning. It became a standard component of Spark in version 0.8 (Sep 2013). The initial contribution was from Berkeley AMPLab. Since then, 50+ developers from the open source community have contributed to its codebase. With the release of Apache Spark 1.0, I’m glad to share some of the new features in MLlib. Among the most important ones are: <ul> <li>sparse data support</li> <li>regression and classification trees</li> <li>distributed matrices</li> <li>PCA and SVD</li> <li>L-BFGS optimization algorithm</li> <li>new user guide and code examples</li> </ul> This is the first in a series of blog posts about features and optimizations in MLlib. We will focus on one feature new in 1.0 — sparse data support. <h2>Large-scale ≈ Sparse</h2> When I was in graduate school, I wrote “large-scale sparse least squares” in a paper draft. My advisor crossed out the word “sparse” and left a comment: “Large-scale already implies sparsity... | Xiangrui | {"createdOn":"2014-07-16","publishedOn":"2014-07-16","tz":"UTC"} | null | 909 | https://databricks.com/blog/2014/07/16/new-features-in-mllib-in-spark-1-0.html | new-features-in-mllib-in-spark-1-0 | publish | New Features in MLlib in Apache Spark 1.0 | 2014-07-16T00:00:00.000+0000 |
| ["Matei Zaharia"] | ["Apache Spark","Engineering Blog"] | <div class="post-meta">This post originally appeared in <a href="http://inside-bigdata.com/2014/07/15/theres-spark-theres-fire-state-apache-spark-2014/" target="_blank">insideBIGDATA</a> and is reposted here with permission.</div> <hr /> With the second <a href="http://spark-summit.org/2014">Spark Summit</a> behind us, we wanted to take a look back at our journey since 2009 when Apache Spark, the fast and general engine for large-scale data processing, was initially developed. It has been exciting and extremely gratifying to watch Spark mature over the years, thanks in large part to the vibrant, open source community that latched onto it and busily began contributing to make Spark what it is today. The idea for Spark first emerged in the AMPLab (AMP stands for Algorithms, Machines, and People) at the University of California, Berkeley. With its significant industry funding and exposure, the AMPlab had a unique perspective on what is important and what issues exist among early adopte... | matei | {"createdOn":"2014-07-19","publishedOn":"2014-07-19","tz":"UTC"} | null | 965 | https://databricks.com/blog/2014/07/18/the-state-of-apache-spark-in-2014.html | the-state-of-apache-spark-in-2014 | publish | The State of Apache Spark in 2014 | 2014-07-19T00:00:00.000+0000 |
| ["Burak Yavuz","Xiangrui Meng","Reynold Xin"] | ["Apache Spark","Engineering Blog","Machine Learning"] | Recommendation systems are among the most popular applications of machine learning. The idea is to predict whether a customer would like a certain item: a product, a movie, or a song. Scale is a key concern for recommendation systems, since computational complexity increases with the size of a company's customer base. In this blog post, we discuss how Apache Spark MLlib enables building recommendation models from billions of records in just a few lines of Python (<a href="http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html">Scala/Java APIs also available</a>).<!--more--> [python] from pyspark.mllib.recommendation import ALS # load training and test data into (user, product, rating) tuples def parseRating(line): fields = line.split() return (int(fields[0]), int(fields[1]), float(fields[2])) training = sc.textFile("...").map(parseRating).cache() test = sc.textFile("...").map(parseRating) # train a recommendation model model = ALS.train(tra... | Xiangrui | {"createdOn":"2014-07-23","publishedOn":"2014-07-23","tz":"UTC"} | null | 980 | https://databricks.com/blog/2014/07/23/scalable-collaborative-filtering-with-spark-mllib.html | scalable-collaborative-filtering-with-spark-mllib | publish | Scalable Collaborative Filtering with Apache Spark MLlib | 2014-07-23T00:00:00.000+0000 |
| ["Li Pu","Reza Zadeh"] | ["Apache Spark","Engineering Blog","Machine Learning"] | <div class="post-meta">Guest post by Li Pu from Twitter and Reza Zadeh from Databricks on their recent contribution to Apache Spark's machine learning library.</div> <hr /> The <a href="http://en.wikipedia.org/wiki/Singular_value_decomposition">Singular Value Decomposition (SVD)</a> is one of the cornerstones of linear algebra and has widespread application in many real-world modeling situations. Problems such as recommender systems, linear systems, least squares, and many others can be solved using the SVD. It is frequently used in statistics where it is related to principal component analysis (PCA) and to correspondence analysis, and in signal processing and pattern recognition. Another usage is latent semantic indexing in natural language processing. Decades ago, before the rise of distributed computing, computer scientists developed the single-core <a href="http://www.caam.rice.edu/software/ARPACK/">ARPACK package</a> for computing the eigenvalue decomposition of a matrix. Since... | matei | {"createdOn":"2014-07-22","publishedOn":"2014-07-22","tz":"UTC"} | null | 1049 | https://databricks.com/blog/2014/07/21/distributing-the-singular-value-decomposition-with-spark.html | distributing-the-singular-value-decomposition-with-spark | publish | Distributing the Singular Value Decomposition with Apache Spark | 2014-07-22T00:00:00.000+0000 |
| ["Scott Walent"] | ["Company Blog","Events"] | From June 30 to July 2, 2014 we held the <a href="http://spark-summit.org/2014">second Spark Summit</a>, a conference focused on promoting the adoption and growth of <a href="http://spark.apache.org">Apache Spark</a>. This was an exciting year for the Spark community and we are proud to share some highlights. <ul> <li>1,164 participants from over 453 companies attended</li> <li>Spark Training sold out at 300 participants</li> <li>31 organizations sponsored the event</li> <li>12 keynotes and 52 community presentations were given</li> </ul> Videos and slides from all presentations are now available on the <a href="http://spark-summit.org/2014/agenda">Summit 2014 agenda</a> page. Some highlights include: <ul> <li>Spark Summit <a href="https://www.youtube.com/watch?v=lO7LhVZrNwA&index=2&list=PL-x35fyliRwiST9gF7Z8Nu3LgJDFRuwfr">keynote from Databricks CEO Ion Stoica</a> introducing <a href="http://www.databricks.com/cloud">Databricks Cloud</a></li> <li>Open source comm... | scott | {"createdOn":"2014-07-23","publishedOn":"2014-07-23","tz":"UTC"} | null | 1081 | https://databricks.com/blog/2014/07/22/spark-summit-2014-highlights.html | spark-summit-2014-highlights | publish | Spark Summit 2014 Highlights | 2014-07-23T00:00:00.000+0000 |
| ["Oscar Mendez (CEO of Stratio)"] | ["Company Blog","Partners"] | <div class="post-meta">This is a guest post from our friends at <a href="http://www.stratio.com" target="_blank">Stratio</a> announcing that their platform is now a "Certified Apache Spark Distribution".</div> <hr /> <h2>Certified distribution</h2> Stratio is delighted to announce that it is officially a Certified Apache Spark Distribution. The certification is very important for us because we deeply believe that the certification program provides many benefits to the Spark community: It facilitates collaboration and integration, offers broad evolution and support for the rich Spark ecosystem, simplifies adoption of critical security updates and allows development of applications valid for any certified distribution - a key ingredient for a successful ecosystem. <!--more--> This post is a brief history of how we started with big data technologies until we made the shift to Spark. <h2>When Stratio met Spark: A true love story</h2> We started using Big Data technologies more than 7 yea... | arsalan | {"createdOn":"2014-08-08","publishedOn":"2014-08-08","tz":"UTC"} | null | 1144 | https://databricks.com/blog/2014/08/08/when-stratio-met-spark-a-true-love-story.html | when-stratio-met-spark-a-true-love-story | publish | When Stratio Met Apache Spark: A True Love Story | 2014-08-08T00:00:00.000+0000 |
| ["Andy Huang (Alibaba Taobao Data Mining Team)","Wei Wu (Alibaba Taobao Data Mining Team)"] | ["Apache Spark","Engineering Blog","Machine Learning"] | <div class="post-meta">This is a guest blog post from our friends at Alibaba Taobao.</div> <hr /> Alibaba Taobao operates one of the world’s largest e-commerce platforms. We collect hundreds of petabytes of data on this platform and use Apache Spark to analyze these enormous amounts of data. Alibaba Taobao probably runs some of the largest Spark jobs in the world. For example, some Spark jobs run for weeks to perform feature extraction on petabytes of image data. In this blog post, we share our experience with Spark and GraphX from prototype to production at the Alibaba Taobao Data Mining Team. <!--more--> Every day, hundreds of millions of users and merchants interact on Alibaba Taobao’s marketplace. These interactions can be expressed as complicated, large scale graphs. Mining data requires a distributed data processing engine that can support fast interactive queries as well as sophisticated algorithms. Spark and GraphX embed a standard set of graph mining algorithms, including ... | rxin | {"createdOn":"2014-08-15","publishedOn":"2014-08-15","tz":"UTC"} | null | 1170 | https://databricks.com/blog/2014/08/14/mining-graph-data-with-spark-at-alibaba-taobao.html | mining-graph-data-with-spark-at-alibaba-taobao | publish | Mining Ecommerce Graph Data with Apache Spark at Alibaba Taobao | 2014-08-15T00:00:00.000+0000 |
| ["Doris Xin","Burak Yavuz","Xiangrui Meng","Hossein Falaki"] | ["Apache Spark","Engineering Blog","Machine Learning"] | One of our philosophies in Apache Spark is to provide rich and friendly built-in libraries so that users can easily assemble data pipelines. With Spark, and MLlib in particular, quickly gaining traction among data scientists and machine learning practitioners, we’re observing a growing demand for data analysis support outside of model fitting. To address this need, we have started to add scalable implementations of common statistical functions to facilitate various components of a data pipeline. <!--more-->We’re pleased to announce Apache Spark 1.1. ships with built-in support for several statistical algorithms common in exploratory data pipelines: <ul> <li><strong>correlations</strong>: data dependence analysis</li> <li><strong>hypothesis testing</strong>: goodness of fit; independence test</li> <li><strong>stratified sampling</strong>: scaling training set with controlled label distribution</li> <li><strong>random data generation</strong>: randomized algorithms; performance t... | Xiangrui | {"createdOn":"2014-08-27","publishedOn":"2014-08-27","tz":"UTC"} | null | 1301 | https://databricks.com/blog/2014/08/27/statistics-functionality-in-spark.html | statistics-functionality-in-spark | publish | Statistics Functionality in Apache Spark 1.1 | 2014-08-27T00:00:00.000+0000 |
| ["Patrick Wendell"] | ["Apache Spark","Engineering Blog","Streaming"] | Today we’re thrilled to announce the release of Apache Spark 1.1! Apache Spark 1.1 introduces many new features along with scale and stability improvements. This post will introduce some key features of Apache Spark 1.1 and provide context on the priorities of Spark for this and the next release.<!--more--> In the next two weeks, we’ll be publishing blog posts with more details on feature additions in each of the major components. Apache Spark 1.1 is already available to Databricks customers and has also been posted today on the <a href="http://spark.apache.org/releases/spark-release-1-1-0.html">Apache Spark website</a>. <!--more--> <h2>Maturity of SparkSQL</h2> The 1.1 released upgrades Spark SQL significantly from the preview delivered in Apache Spark 1.0. At Databricks, we’ve migrated all of our customer workloads from Shark to Spark SQL, with between 2X and 5X <a href="https://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html">perfo... | patrick | {"createdOn":"2014-09-12","publishedOn":"2014-09-12","tz":"UTC"} | null | 1360 | https://databricks.com/blog/2014/09/11/announcing-spark-1-1.html | announcing-spark-1-1 | publish | Announcing Apache Spark 1.1 | 2014-09-12T00:00:00.000+0000 |
| ["Arsalan Tavakoli-Shiraji","Tathagata Das","Patrick Wendell"] | ["Apache Spark","Engineering Blog","Streaming"] | With Apache Spark 1.1 recently released, we’d like to take this occasion to feature one of the most popular Spark components - Spark Streaming - and highlight who is using Spark Streaming and why. Apache Spark 1.1. adds several new features to Spark Streaming. In particular, Spark Streaming extends its library of ingestion sources to include Amazon Kinesis, a hosted stream processing engine, as well as to provide high availability for Apache Flume sources. Moreover, Apache Spark 1.1 adds the first of a set of online machine learning algorithms with the introduction of a streaming linear regression. Many organizations have evolved from exploratory, discovery use cases of big data to use cases that require reasoning on data as it arrives in order to make decisions in real time. Spark Streaming enables this category of high-value use cases, providing a system for processing fast and large streams of data in real time. <b>What is it?</b> Spark Streaming is an extension of the core S... | arsalan | {"createdOn":"2014-09-16","publishedOn":"2014-09-16","tz":"UTC"} | null | 1386 | https://databricks.com/blog/2014/09/16/spark-1-1-the-state-of-spark-streaming.html | spark-1-1-the-state-of-spark-streaming | publish | Apache Spark 1.1: The State of Spark Streaming | 2014-09-16T00:00:00.000+0000 |
| ["Burak Yavuz","Xiangrui Meng"] | ["Apache Spark","Engineering Blog","Machine Learning"] | With an ever-growing community, Apache Spark has had it’s <a href="https://databricks.com/blog/2014/09/11/announcing-spark-1-1.html" target="_blank">1.1 release</a>. MLlib has had its fair share of contributions and now supports many new features. We are excited to share some of the performance improvements observed in MLlib since the 1.0 release, and discuss two key contributing factors: torrent broadcast and tree aggregation. <h2>Torrent broadcast</h2> The beauty of Spark as a unified framework is that any improvements made on the core engine come for free in its standard components like MLlib, Spark SQL, Streaming, and GraphX. In Apache Spark 1.1, we changed the default broadcast implementation of Spark from the traditional <code>HttpBroadcast</code> to <code>TorrentBroadcast</code>, a BitTorrent like protocol that evens out the load among the driver and the executors. When an object is broadcasted, the driver divides the serialized object into multiple chunks, and broadcasts the ch... | Xiangrui | {"createdOn":"2014-09-22","publishedOn":"2014-09-22","tz":"UTC"} | null | 1393 | https://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html | spark-1-1-mllib-performance-improvements | publish | Apache Spark 1.1: MLlib Performance Improvements | 2014-09-22T00:00:00.000+0000 |
| ["Gavin Targonski (Product Management at Talend)"] | ["Company Blog","Partners"] | <div class="post-meta">This post is guest authored by our friends at <a href="http://www.talend.com" target="_blank">Talend</a> after having Talend Studio <a href="http://www.databricks.com/certification" target="_blank">“Certified on Apache Spark.”</a></div> <hr /> As the move to the next generation of integration platforms grows momentum, the need to implement a proven and scalable technology is critical. Databricks and Apache Spark, delivered on the major Hadoop distributions, is one such area where the delivery of massively scalable technology with low risk implementation is really key. At Talend we see a wide array of batch processes, moving to an operational and real time perspective, driven by the consumers of the data. In this vein, the uptake in adoption and the growing community of Apache Spark, the powerful open-source processing engine, has been hard to miss. In a relatively short time, it is now a part of every major Hadoop vendor’s offering, is the most active open sou... | arsalan | {"createdOn":"2014-09-15","publishedOn":"2014-09-15","tz":"UTC"} | null | 1411 | https://databricks.com/blog/2014/09/15/application-spotlight-talend.html | application-spotlight-talend | publish | Application Spotlight: Talend | 2014-09-15T00:00:00.000+0000 |
| ["Nick Pentreath (Graphflow)","Kan Zhang (IBM)"] | ["Apache Spark","Engineering Blog"] | <div class="post-meta">This is a guest post by Nick Pentreath of <a href="http://graphflow.com">Graphflow</a> and Kan Zhang of <a href="http://ibm.com">IBM</a>, who contributed Python input/output format support to Apache Spark 1.1.</div> <hr /> Two powerful features of Apache Spark include its native APIs provided in Scala, Java and Python, and its compatibility with any Hadoop-based input or output source. This language support means that users can quickly become proficient in the use of Spark even without experience in Scala, and furthermore can leverage the extensive set of third-party libraries available (for example, the many data analysis libraries for Python). Built-in Hadoop support means that Spark can work "out of the box" with any data storage system or format that implements Hadoop's <code>InputFormat</code> and <code>OutputFormat</code> interfaces, including HDFS, HBase, Cassandra, Elasticsearch, DynamoDB and many others, as well as various data serialization formats s... | matei | {"createdOn":"2014-09-18","publishedOn":"2014-09-18","tz":"UTC"} | null | 1431 | https://databricks.com/blog/2014/09/17/spark-1-1-bringing-hadoop-inputoutput-formats-to-pyspark.html | spark-1-1-bringing-hadoop-inputoutput-formats-to-pyspark | publish | Apache Spark 1.1: Bringing Hadoop Input/Output Formats to PySpark | 2014-09-18T00:00:00.000+0000 |
| ["Vida Ha"] | ["Company Blog","Product"] | At Databricks, we are often asked how to go beyond the basic Apache Spark tutorials and start building real applications with Spark. As a result, we are developing reference applications <a href="http://github.com/databricks/reference-apps" target="_blank">on github</a> to demonstrate that. We believe this is a great way to learn Spark, and we plan on incorporating more features of Spark into the applications over time. We also hope to highlight any technologies that are compatible with Spark and include best practices. <h3>Log Analyzer Application</h3> Our first reference application is log analysis with Spark. Logs are a large and common data set that contain a rich set of information. Log data can be used for monitoring web servers, improving business and customer intelligence, building recommendation systems, preventing fraud, and much more. Spark is a wonderful tool to use on logs - Spark can process logs faster than Hadoop MapReduce, it is easy to code so we can compute many... | vida | {"createdOn":"2014-09-24","publishedOn":"2014-09-24","tz":"UTC"} | null | 1460 | https://databricks.com/blog/2014/09/23/databricks-reference-applications.html | databricks-reference-applications | publish | Databricks Reference Applications | 2014-09-24T00:00:00.000+0000 |
| ["John Tripier","Paco Nathan"] | ["Announcements","Company Blog"] | When Databricks was initially founded a little more than a year ago, there was tremendous excitement around Apache Spark, but it was still early days. The project had ~60 contributors over the previous 12 months, and was not yet available commercially. One of our main focus areas since then has been continuing to grow Spark and the community and making it easily accessible for enterprises and users alike. Taking a step back, it’s terrific to see the progress that Spark has made since then. Spark is today the most active open source project in the Big Data ecosystem with over 300 contributors in the last 12 months alone, and is available through several platform vendors, including all of the major Hadoop distributors. The <a href="http://www.spark-summit.org" target="_blank">Spark Summit</a>, dedicated to bringing together the Spark community, more than doubled in size a short 6 months after the inaugural version, and Spark meetups continue to grow in size, frequency, and cities sp... | john | {"createdOn":"2014-09-19","publishedOn":"2014-09-19","tz":"UTC"} | null | 1504 | https://databricks.com/blog/2014/09/18/databricks-and-oreilly-media-launch-certification-program-for-apache-spark-developers.html | databricks-and-oreilly-media-launch-certification-program-for-apache-spark-developers | publish | Databricks and O'Reilly Media launch Certification Program for Apache Spark Developers | 2014-09-19T00:00:00.000+0000 |
| ["Christopher Burdorf (Senior Software Engineer at NBC Universal)"] | ["Company Blog","Customers"] | <div class="post-meta">This is a guest blog post from our friends at NBC Universal outlining their Apache Spark use case.</div> <hr /> <h2>Business Challenge</h2> NBC Universal is one of the world’s largest media and entertainment companies with revenues of US$ 26 billion. It operates television networks, cable channels, motion picture and television production companies as well as branded theme parks worldwide. Popular brands include NBC, Universal Pictures, Universal Parks & Resorts, Telemundo, E!, Bravo and MSNBC. Digital video media clips for NBC Universal’s cable TV programs and commercials are produced and broadcast from its Los Angeles office to cable TV channels in Asia Pacific, Europe, Latin America and the United States. Moreover, viewers increasingly consume NBC Universal’s vast content library online and on-demand. Therefore, NBC Universal’s IT Infrastructure team needs to make decisions on how best to serve that content, which involves a trade-off between storage a... | arsalan | {"createdOn":"2014-09-24","publishedOn":"2014-09-24","tz":"UTC"} | null | 1619 | https://databricks.com/blog/2014/09/24/apache-spark-improves-the-economics-of-video-distribution-at-nbc-universal.html | apache-spark-improves-the-economics-of-video-distribution-at-nbc-universal | publish | Apache Spark Improves the Economics of Video Distribution at NBC Universal | 2014-09-24T00:00:00.000+0000 |
| ["Manish Amde (Origami Logic)","Joseph Bradley (Databricks)"] | ["Engineering Blog","Machine Learning"] | <div class="post-meta">This is a post written together with one of our friends at <a href="http://www.origamilogic.com/">Origami Logic</a>. Origami Logic provides a Marketing Intelligence Platform that uses Apache Spark for heavy lifting analytics work on the backend.</div> <hr /> Decision trees and their ensembles are industry workhorses for the machine learning tasks of classification and regression. Decision trees are easy to interpret, handle categorical and continuous features, extend to multi-class classification, do not require feature scaling and are able to capture non-linearities and feature interactions. Due to their popularity, almost every machine learning library provides an implementation of the decision tree algorithm. However, most are designed for single-machine computation and seldom scale elegantly to a distributed setting. Apache Spark is an ideal platform for a scalable distributed decision tree implementation since Spark's in-memory computing allows us to effi... | joseph | {"createdOn":"2014-09-30","publishedOn":"2014-09-30","tz":"UTC"} | null | 1507 | https://databricks.com/blog/2014/09/29/scalable-decision-trees-in-mllib.html | scalable-decision-trees-in-mllib | publish | Scalable Decision Trees in MLlib | 2014-09-30T00:00:00.000+0000 |
| ["Eric Carr (VP Core Systems Group at Guavus)"] | ["Company Blog","Partners"] | <div class="post-meta">This is a guest blog post from our friends at <a href="http://www.guavus.com" target="_blank">Guavus</a> - now a Certified Apache Spark Distribution - outlining how they leverage Spark to deliver value to telecom companies.</div> <hr /> <h2>Business Challenge</h2> Guavus is a leading provider of big data analytics solutions for the Communications Service Provider (CSP) industry. The company counts 4 of the top 5 mobile network operators, 3 of the top 5 Internet backbone providers, as well as 80% of cable MSOs in North America as customers. The Guavus Reflex platform provides operational intelligence to these service providers. Reflex currently analyzes more than 50% of all US mobile data traffic and processes more than 2.5 petabytes of data per day. Yet that data grows at an exponential rate. Ever increasing data volume and velocity makes it harder to generate timely insights. For instance, one operational issue can quickly cascade into multiple issues down-st... | arsalan | {"createdOn":"2014-09-25","publishedOn":"2014-09-25","tz":"UTC"} | null | 1626 | https://databricks.com/blog/2014/09/25/guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcos.html | guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcos | publish | Guavus Embeds Apache Spark into its Operational Intelligence Platform Deployed at the World’s Largest Telcos | 2014-09-25T00:00:00.000+0000 |
| ["Jeremy Freeman (Freeman Lab)"] | ["Apache Spark","Engineering Blog","Streaming"] | The brain is the most complicated organ of the body, and probably one of the most complicated structures in the universe. It’s millions of neurons somehow work together to endow organisms with the extraordinary ability to interact with the world around them. Things our brains control effortlessly -- kicking a ball, or reading and understanding this sentence -- have proven extremely hard to implement in a machine. For a long time, our efforts were limited by experimental technology. Despite the brain having many neurons, most technologies could only monitor the activity of one, or a handful, at once. That these approaches taught us so much -- for example, that there are neurons that respond only when you look at a particular object -- is a testament to experimental ingenuity. In the next era, however, we will be limited not by our recordings, but our ability to make sense of the data. New technologies make it possible to monitor the activity of many thousands of neurons at once -- fro... | arsalan | {"createdOn":"2014-10-01","publishedOn":"2014-10-01","tz":"UTC"} | null | 1648 | https://databricks.com/blog/2014/10/01/spark-as-a-platform-for-large-scale-neuroscience.html | spark-as-a-platform-for-large-scale-neuroscience | publish | Apache Spark as a platform for large-scale neuroscience | 2014-10-01T00:00:00.000+0000 |
| ["Russell Cardullo (Sharethrough)"] | ["Company Blog","Customers"] | <div class="post-meta">This is a guest blog post from our friends at <a href="http://www.sharethrough.com" target="_blank">Sharethrough</a> providing an update on how their use of Apache Spark has continued to expand.</div> <hr /> <h2>Business Challenge</h2> Sharethrough is an advertising technology company that provides native, in-feed advertising software to publishers and advertisers. Native, in-feed ads are designed to match the form and function of the sites they live on, which is particularly important on mobile devices where interruptive advertising is less effective. For publishers, in-feed monetization has become a major revenue stream for their mobile sites and applications. For advertisers, in-feed ads have been proven to drive more brand lift than interruptive banner advertisements. Sharethrough’s publisher and advertiser technology suite is capable of optimizing the format of an advertisement for seamless placement on content publishers websites and apps. This involves ... | arsalan | {"createdOn":"2014-10-07","publishedOn":"2014-10-07","tz":"UTC"} | null | 1668 | https://databricks.com/blog/2014/10/07/sharethrough-uses-spark-streaming-to-optimize-advertisers-return-on-marketing-investment.html | sharethrough-uses-spark-streaming-to-optimize-advertisers-return-on-marketing-investment | publish | Sharethrough Uses Apache Spark Streaming to Optimize Advertisers' Return on Marketing Investment | 2014-10-07T00:00:00.000+0000 |
| ["Sean Kandel (CTO at Trifacta)"] | ["Company Blog","Partners"] | <div class="post-meta">This post is guest authored by our friends at <a href="http://www.trifacta.com" target="_blank">Trifacta</a> after having their data transformation platform <a href="http://www.databricks.com/certification" target="_blank">“Certified on Spark.”</a></div> <hr> Today we announced v2 of the Trifacta Data Transformation Platform, a release that emphasizes the important role that Hadoop plays in the new big data enterprise architecture. With Trifacta v2 we now support transforming data of all shapes and sizes in Hadoop. This means supporting Hadoop-specific data formats as both inputs and outputs in Trifacta v2 - data formats such as Avro, ORC and Parquet. It also means intelligently executing data transformation scripts through not only MapReduce, which was available in Trifacta v1, but also Spark. Trifacta v2 has been officially Certified on Spark by Databricks. Our partnership with Databricks brings the performance and flexibility of the Spark data processing en... | arsalan | {"createdOn":"2014-10-09","publishedOn":"2014-10-09","tz":"UTC"} | null | 1678 | https://databricks.com/blog/2014/10/09/application-spotlight-trifacta.html | application-spotlight-trifacta | publish | Application Spotlight: Trifacta | 2014-10-09T00:00:00.000+0000 |
| ["Reynold Xin"] | ["Apache Spark","Engineering Blog"] | <strong>Update November 5, 2014</strong>: Our benchmark entry has been reviewed by the benchmark committee and Apache Spark has won the <a href="http://sortbenchmark.org/">Daytona GraySort contest</a> for 2014! Please see this <a href="https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html">new blog post for update</a>. Apache Spark has seen phenomenal adoption, being widely slated as the successor to Hadoop MapReduce, and being deployed in clusters from a handful to thousands of nodes. While it was clear to everybody that Spark is more efficient than MapReduce for data that fits in memory, we heard that some organizations were having trouble pushing it to large scale datasets that could not fit in memory. Therefore, since the inception of Databricks, we have devoted much effort, together with the Spark community, to improve the stability, scalability, and performance of Spark. Spark works well for gigabytes or terabytes of data, and it s... | rxin | {"createdOn":"2014-10-10","publishedOn":"2014-10-10","tz":"UTC"} | null | 1685 | https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html | spark-petabyte-sort | publish | Apache Spark the fastest open source engine for sorting a petabyte | 2014-10-10T00:00:00.000+0000 |
| ["Reza Zadeh"] | ["Apache Spark","Engineering Blog","Machine Learning"] | <div class="post-meta">Our friends at Twitter have contributed to MLlib, and this post uses material from Twitter’s description of its <a href="https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum" target="_blank">open-source contribution</a>, with permission. The associated <a href="https://github.com/apache/spark/pull/1778" target="_blank">pull request</a> is slated for release in Apache Spark 1.2.</div> <hr /> <h2>Introduction</h2> We are often interested in finding users, hashtags and ads that are very similar to one another, so they may be recommended and shown to users and advertisers. To do this, we must consider many pairs of items, and evaluate how “similar” they are to one another. We call this the “all-pairs similarity” problem, sometimes known as a “similarity join.” We have developed a new efficient algorithm to solve the similarity join called “Dimension Independent Matrix Square using MapReduce,” or <a href="http://arxiv.org/abs/1304.1467" target="_blank">DIM... | arsalan | {"createdOn":"2014-10-20","publishedOn":"2014-10-20","tz":"UTC"} | null | 1743 | https://databricks.com/blog/2014/10/20/efficient-similarity-algorithm-now-in-spark-twitter.html | efficient-similarity-algorithm-now-in-spark-twitter | publish | Efficient similarity algorithm now in Apache Spark, thanks to Twitter | 2014-10-20T00:00:00.000+0000 |
| ["Jeff Feng (Product Manager at Tableau Software)"] | ["Company Blog","Partners"] | <div class="post-meta">This post is guest authored by our friends at <a href="http://www.tableausoftware.com" target="_blank">Tableau Software</a>, whose visual analytics software is now <a href="http://www.databricks.com/certification" target="_blank">“Certified on Apache Spark.”</a></div> <hr /> <img class="aligncenter size-full wp-image-62" style="max-width: 100%; display: block; margin: 30px auto 5px auto;" src="https://databricks.com/wp-content/uploads/2014/10/Tableau-SparkSQL.png" alt="" align="middle" /> <h2>Apache Spark - The Next Big Innovation</h2> Once every few years or so, the big data open source community experiences a major innovation that advances the capabilities of data processing frameworks. For many years, MapReduce and the Hadoop open-source platform served as an effective foundation for the distributed processing of large data sets. Then last year, the introduction of YARN provided the resource manager needed to enable interactive workloads, bringing data proce... | arsalan | {"createdOn":"2014-10-15","publishedOn":"2014-10-15","tz":"UTC"} | null | 1773 | https://databricks.com/blog/2014/10/15/application-spotlight-tableau-software.html | application-spotlight-tableau-software | publish | Application Spotlight: Tableau Software | 2014-10-15T00:00:00.000+0000 |
| ["Scott Walent"] | ["Announcements","Company Blog","Events"] | The call for presentations for the inaugural <a href="http://spark-summit.org/east">Spark Summit East</a> is now open. Please join us in New York City on March 18-19, 2015 to share your experience with Apache Spark and celebrate its growing community. Spark Summit East is looking for presenters who would like to showcase how Spark and its related technologies are used in applications, development, data science and research. Please visit our <a href="http://www.spark-summit.org/east/2015/CFP">submission page</a> for additional details. The Deadline for submissions is December 5, 2014 at 11:59pm PST. Spark Summit East is the leading event for <a href="http://spark.apache.org">Apache Spark </a>users, developers and vendors. It is an exciting opportunity to meet analysts, researchers, developers and executives interested in utilizing Spark technology to answer big data questions. If you missed <a href="http://spark-summit.org/2014">Spark Summit 2014</a>, all the content is available onl... | scott | {"createdOn":"2014-10-23","publishedOn":"2014-10-23","tz":"UTC"} | null | 1809 | https://databricks.com/blog/2014/10/23/spark-summit-east-cfp-now-open.html | spark-summit-east-cfp-now-open | publish | Spark Summit East - CFP now open | 2014-10-23T00:00:00.000+0000 |
| ["Ari Himmel (CEO at Faimdata)","Nan Zhu (Chief Architect at Faimdata)"] | ["Company Blog","Partners"] | <div class="post-meta">This post is guest authored by our friends at <a href="http://www.faimdata.com" target="_blank">Faimdata</a>, whose Consumer Data Intelligence Service is now <a href="http://www.databricks.com/certification" target="_blank">“Certified on Apache Spark.”</a></div> <hr /> <h2>Forecasting, Analytics, Intelligence, Machine Learning</h2> Faimdata’s Consumer Data Intelligence Service is a turnkey Big Data solution that provides comprehensive infrastructure and applications to retailers. We help our clients form close connections with their customers and make timely business decisions, using their existing data sources. The unified data processing pipeline deployed by Faimdata has three core focuses. They are (i) our Personalization Service that identifies the personal preferences and buying behaviors of each individual consumer using recommendation/machine learning algorithms; (ii) our Data Analytic Workbench where clients execute high performance multi-dimensional an... | arsalan | {"createdOn":"2014-10-27","publishedOn":"2014-10-27","tz":"UTC"} | null | 1820 | https://databricks.com/blog/2014/10/27/application-spotlight-faimdata.html | application-spotlight-faimdata | publish | Application Spotlight: Faimdata | 2014-10-27T00:00:00.000+0000 |
| ["John Kreisa (VP of Strategic Marketing at Hortonworks)"] | ["Company Blog","Partners"] | <div class="post-meta">This post is guest authored by our friends at <a href="http://www.hortonworks.com" target="_blank">Hortonworks</a> announcing a broader partnership with Databricks around Apache Spark.</div> <hr> At Hortonworks we are very excited by the emerging use cases and potential of Apache Spark and Apache Hadoop. Spark is representative of just one of the shifts underway in the data landscape towards memory optimized processing, that when combined with Hadoop, can enable a new generation of applications. We are excited to announce that Hortonworks and Databricks have extended our partnership focus from providing a <a href="https://databricks.com/spark/certification/certified-spark-distribution" target="_blank">Certified Spark Distribution</a> to include a shared vision to further Apache Spark as an enterprise ready component of the Hortonworks Data Platform. We are closely aligned on a strategy and vision of bringing 100% open source software to market for the enterp... | arsalan | {"createdOn":"2014-10-31","publishedOn":"2014-10-31","tz":"UTC"} | null | 1823 | https://databricks.com/blog/2014/10/31/hortonworks-a-shared-vision-for-apache-spark-on-hadoop.html | hortonworks-a-shared-vision-for-apache-spark-on-hadoop | publish | Hortonworks: A shared vision for Apache Spark on Hadoop | 2014-10-31T00:00:00.000+0000 |
| ["Sachin Chawla (VP of Engineering)"] | ["Company Blog","Partners"] | <div class="post-meta">This post is guest authored by our friends at <a href="http://www.skytree.net" target="_blank">Skytree</a>, whose Skytree Infinity platform is now <a href="http://www.databricks.com/certification" target="_blank">“Certified on Apache Spark.”</a></div> <hr /> <h2>To Infinity and Beyond - Big Data at the speed of light</h2> Astronomers were into Big Data before it was big. In order to learn about the history of the universe, they needed to observe and record billions and billions of astronomical objects and perform heavy-duty analysis on the resulting massive datasets. Available predictive methods were not scalable to the size of data sets they were dealing with so they turned to Skytree to obtain unprecedented performance and accuracy on the largest datasets ever collected. Fast-forward a decade or so and the need to store, access, process and analyze datasets of astronomical sizes is now mainstream in the guise of Big Data analytics. <a href="http://www.skytre... | john | {"createdOn":"2014-11-25","publishedOn":"2014-11-25","tz":"UTC"} | null | 1974 | https://databricks.com/blog/2014/11/24/application-spotlight-skytree-infinity.html | application-spotlight-skytree-infinity | publish | Application Spotlight: Skytree Infinity | 2014-11-25T00:00:00.000+0000 |
| ["Sonal Goyal (CEO)"] | ["Company Blog","Partners"] | <div class="post-meta">This post is guest authored by our friends at <a href="http://nubetech.co/" target="_blank">Nube Technologies</a>, whose Reifier platform is now <a href="http://www.databricks.com/certification" target="_blank">“Certified on Apache Spark.”</a></div> <hr /> <h2>About Nube Technologies</h2> Nube Technologies builds business applications to better decision making through better data. Nube’s fuzzy matching product Reifier helps companies get a holistic view of enterprise data. By linking and resolving entities across various sources, Reifier helps optimize the sales and marketing funnel, promotes enhanced security and risk management and better consolidation and reporting of business data. We help our customers build better and effective models by ensuring that their underlying master data is accurate. <h2>Why Apache Spark</h2> Data matching within a single source or across sources is a very core problem faced by almost every enterprise and we wanted to create a re... | john | {"createdOn":"2014-12-02","publishedOn":"2014-12-02","tz":"UTC"} | null | 2006 | https://databricks.com/blog/2014/12/02/application-spotlight-nube-reifier.html | application-spotlight-nube-reifier | publish | Application Spotlight: Nube Reifier | 2014-12-02T00:00:00.000+0000 |
| [" Dibyendu Bhattacharya (Big Data Architect)"] | ["Company Blog","Partners"] | <div class="post-meta">This is a guest blog post from our friends at Pearson outlining their Apache Spark use case.</div> <hr /> <h2>Introduction of Pearson</h2> Pearson is a British multinational publishing and education company headquartered in London. It is the largest education company and the largest book publisher in the world. Recently, Pearson announced a new organization structure in order to accelerate their push into digital learning, education services and emerging markets. I am part of Pearson Higher Education group, which provides textbooks and digital technologies to teachers and students across Higher Education. Pearson's higher education brands include eCollege, Mastering/MyLabs and Financial Times Publishing. <h2>What we wanted to do</h2> We are building a next generation adaptive learning platform which delivers immersive learning experiences designed for the way today’s students read, think, and learn. This learning platform is a scalable, reliable, cloud-based pl... | john | {"createdOn":"2014-12-09","publishedOn":"2014-12-09","tz":"UTC"} | null | 2027 | https://databricks.com/blog/2014/12/08/pearson-uses-spark-streaming-for-next-generation-adaptive-learning-platform.html | pearson-uses-spark-streaming-for-next-generation-adaptive-learning-platform | publish | Pearson uses Apache Spark Streaming for next generation adaptive learning platform | 2014-12-09T00:00:00.000+0000 |
| ["Reynold Xin"] | ["Apache Spark","Engineering Blog"] | A month ago, we shared with you our entry to the 2014 Gray Sort competition, a 3rd-party benchmark measuring how fast a system can sort 100 TB of data (1 trillion records). Today, we are happy to announce that our entry has been reviewed by the benchmark committee and we have officially won the <a href="http://sortbenchmark.org/">Daytona GraySort contest</a>! In case you missed our <a href="https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html">earlier blog post</a>, using Spark on 206 EC2 machines, we sorted 100 TB of data on disk in 23 minutes. In comparison, the previous world record set by Hadoop MapReduce used 2100 machines and took 72 minutes. This means that Apache Spark sorted the same data <strong>3X faster</strong> using <strong>10X fewer machines</strong>. All the sorting took place on disk (HDFS), without using Spark’s in-memory cache. This entry tied with a UCSD research team building high performance systems and we jointly set a new world record. <table class="... | rxin | {"createdOn":"2014-11-05","publishedOn":"2014-11-05","tz":"UTC"} | null | 2465 | https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html | spark-officially-sets-a-new-record-in-large-scale-sorting | publish | Apache Spark officially sets a new record in large-scale sorting | 2014-11-05T00:00:00.000+0000 |
| ["Matt MacKinnon (Director of Product Management at Zaloni)"] | ["Company Blog","Partners"] | <div class="post-meta">This post is guest authored by our friends at <a href="http://www.zaloni.com" target="_blank">Zaloni</a>, whose Bedrock platform is now <a href="http://www.databricks.com/certification" target="_blank">“Certified on Apache Spark.”</a></div> <hr /> <h2>Bedrock’s Managed Data Pipeline now includes Apache Spark</h2> It was evident from the all the buzz at the Strata + Hadoop World conference that Apache Spark has now shifted from the early adopter phase to establishing itself as an integral and permanent part of the Hadoop ecosystem. The rapid pace of adoption is impressive! Given the entrance of Spark into the mainstream Hadoop world, we are glad to announce that Bedrock is now officially Certified on Spark. <h2>How does Spark enhance Bedrock?</h2> Bedrock™ defines a Managed Data Pipeline as consisting of Ingest, Organize, and Prepare stages. Bedrock’s strength lies in the integrated nature of the way data is handled through these stages. ● Ingest: Bring data fr... | john | {"createdOn":"2014-11-14","publishedOn":"2014-11-14","tz":"UTC"} | null | 2466 | https://databricks.com/blog/2014/11/14/application-spotlight-bedrock.html | application-spotlight-bedrock | publish | Application Spotlight: Bedrock | 2014-11-14T00:00:00.000+0000 |
| ["John Tripier","Paco Nathan"] | ["Announcements","Company Blog"] | More and more companies are using Apache Spark, and many Spark based pilots are currently deploying in production. In social media, at every big data conference or meetup, people describe new POC, prototypes, and production deployments using Spark. Behind this momentum, a growing need for Spark developers is developing; people who have demonstrated expertise in how to implement best practices for Spark. People who can help the enterprise building increasingly complex and sophisticated solutions on top of their Spark deployments. At Databricks, we get contacted by many enterprises looking for Spark resources to help with their next data-driven initiative. And so beyond our effort to train people on Spark directly or through partners all around the world, we have teamed up with O’Reilly for offering the first industry standard for measuring and validating a developer’s expertise on Spark. <h2>Benefits of being a Spark Certified Developer</h2> The Spark Developer Certification is the wa... | john | {"createdOn":"2014-11-15","publishedOn":"2014-11-15","tz":"UTC"} | null | 2467 | https://databricks.com/blog/2014/11/14/the-spark-certified-developer-program.html | the-spark-certified-developer-program | publish | The Apache Spark Certified Developer Program | 2014-11-15T00:00:00.000+0000 |
| ["Luis Quintela (Sr. Manager of Big Data Analytics)","Yan Breek (Data Scientist)","Girish Kathalagiri (Data Analytics Engineer)"] | ["Company Blog","Partners"] | <div class="post-meta">This is a guest blog post from our friends at Samsung SDS outlining their Apache Spark use case.</div> <hr /> <h2>Business Challenge</h2> Samsung SDS is the business and IT solutions arm of Samsung Group. A global ICT service provider with over 17,000 employees worldwide and 6.7 billion USD in revenues, Samsung SDS tackles the challenges of some of the largest global enterprises in such industries as manufacturing, financial services, health care and retail. In the different areas Samsung is focused on, the ability to make timely decisions that maximize the value to a business becomes critical. Prescriptive analytics methods have been used effectively to support decision making by leveraging probable future outcomes determined by predictive models and suggesting actions that provide maximal business value. One of the main challenges in applying prescriptive analytics in these areas is the need to analyze a combination of structured and unstructured data at la... | john | {"createdOn":"2014-11-22","publishedOn":"2014-11-22","tz":"UTC"} | null | 2468 | https://databricks.com/blog/2014/11/21/samsung-sds-uses-spark-for-prescriptive-analytics-at-large-scale.html | samsung-sds-uses-spark-for-prescriptive-analytics-at-large-scale | publish | Samsung SDS uses Apache Spark for prescriptive analytics at large scale | 2014-11-22T00:00:00.000+0000 |
| ["Ameet Talwalkar","Anthony Joseph"] | ["Announcements","Company Blog"] | In the age of ‘Big Data,’ with datasets rapidly growing in size and complexity and cloud computing becoming more pervasive, data science techniques are fast becoming core components of large-scale data processing pipelines. Apache Spark offers analysts and engineers a powerful tool for building these pipelines, and learning to build such pipelines will soon be a lot easier. Databricks is excited to be working with professors from University of California Berkeley and University of California Los Angeles to produce two new upcoming Massive Open Online Courses (MOOCs). Both courses will be freely available on the edX MOOC platform in <del>spring</del> summer 2015. edX Verified Certificates are also available for a fee. <img class="aligncenter size-full wp-image-62" style="max-width: 100%; display: block; margin: 30px auto 5px auto;" src="https://databricks.com/wp-content/uploads/2014/12/MOOC1.png" alt="" align="middle" /> The first course, called <a href="https://www.edx.org/course/uc... | arsalan | {"createdOn":"2014-12-02","publishedOn":"2014-12-02","tz":"UTC"} | null | 2469 | https://databricks.com/blog/2014/12/02/announcing-two-spark-based-moocs.html | announcing-two-spark-based-moocs | publish | Databricks to run two massive online courses on Apache Spark | 2014-12-02T00:00:00.000+0000 |
| ["Lieven Gesquiere (Virdata Lead Core R&D)"] | ["Company Blog","Partners"] | <div class="post-meta">This post is guest authored by our friends at <a href="http://www.technicolor.com/" target="_blank">Technicolor</a>, whose Virdata platform is now <a href="http://www.databricks.com/certification" target="_blank">“Certified on Apache Spark.”</a></div> <hr /> <h2>About Virdata</h2> Virdata is Technicolor’s cloud-native Internet of Things platform offering real-time monitoring, configuration and management of the unprecedented number of connected devices and applications. Combining its highly-scalable data ingestion and messaging capabilities with real-time and historical analytics, Virdata brings value across multiple data-driven markets. The Virdata platform was launched at CES Las Vegas in January, 2014. The Virdata cloud-based platform architecture integrates state-of-the-art open source software components into a homogeneous, high-availability data-processing environment. <h2>Virdata and Apache Spark</h2> The Virdata solution architecture comprises 3 areas:... | john | {"createdOn":"2014-12-04","publishedOn":"2014-12-04","tz":"UTC"} | null | 2470 | https://databricks.com/blog/2014/12/03/application-spotlight-technicolor-virdata-internet-of-things-platform.html | application-spotlight-technicolor-virdata-internet-of-things-platform | publish | Application Spotlight: Technicolor Virdata Internet of Things platform | 2014-12-04T00:00:00.000+0000 |
| ["by Databricks Press Office"] | ["Announcements","Company Blog"] | <strong>Highlights:</strong> <ul> <li>Databricks Expands Bay Area Presence, Moves HQ to San Francisco</li> <li>Company Names Kavitha Mariappan as Marketing Vice President</li> </ul> Press Release: <a title="http://finance.yahoo.com/news/databricks-expands-bay-area-presence-140000610.html" href="http://finance.yahoo.com/news/databricks-expands-bay-area-presence-140000610.html">http://finance.yahoo.com/news/databricks-expands-bay-area-presence-140000610.html</a> <strong>San Francisco, Calif. – January 13, 2015 – </strong><a href="http://www.databricks.com">Databricks</a>, the company founded by the creators of the popular open-source Big Data processing engine Apache Spark with its flagship product, Databricks Cloud, today announced the relocation of their headquarters to San Francisco from Berkeley, California. The expansion is a reflection of Databricks’ growth heading into 2015. The company grew more than 200 percent in headcount over the last year and adds talent to its executive ... | kavitha | {"createdOn":"2015-01-13","publishedOn":"2015-01-13","tz":"UTC"} | null | 2294 | https://databricks.com/blog/2015/01/13/databricks-expands-bay-area-presence-moves-hq-to-san-francisco.html | databricks-expands-bay-area-presence-moves-hq-to-san-francisco | publish | Databricks Expands Bay Area Presence, Moves HQ to San Francisco | 2015-01-13T00:00:00.000+0000 |
| ["Kavitha Mariappan"] | ["Announcements","Company Blog"] | Complementing our on-going direct and partner-led Apache Spark training efforts, Databricks has teamed up with O’Reilly to offer the industry’s first standard for measuring and validating a developer’s expertise with Spark. Databricks and O’Reilly are proud to announce the online availability of the Spark Certified Developer exams. You can now sign up and take the exam online<a href=" http://go.databricks.com/spark-certified-developer"> here</a>. <b>What is the Spark Certified Developer program?</b> Apache Spark is the most active project in the Big Data ecosystem and is fast becoming the open source alternative of choice for many enterprises. Spark provides enterprises with the scale and sophistication they require to gain insights from their Big Data by providing a unified framework for building data pipelines. Databricks was founded by the team that created and continues to lead both development and training around Spark, and<a href="https://databricks.com/product"> Databricks Cl... | kavitha | {"createdOn":"2015-01-16","publishedOn":"2015-01-16","tz":"UTC"} | null | 2345 | https://databricks.com/blog/2015/01/16/spark-certified-developer-exams-available-online.html | spark-certified-developer-exams-available-online | publish | Apache Spark Certified Developer exams available online! | 2015-01-16T00:00:00.000+0000 |
| ["Kavitha Mariappan"] | ["Company Blog","Events"] | We are thrilled to announce the availability of the <a href="http://go.spark-summit.org/e1t/c/*W6stDzJ6_3DYhW6Y-qp35L8r5j0/*W4PZ7v36VwsQzW58WPXZ57MJJH0/5/f18dQhb0Sq5z8YHrDTW8HLj0x5VQHw7W6bFhBV6P7FhxW4R4BZM57mvC2W1BQYgg4P0TLvW85Q81T83G7d1W9dtj1h7NQNCqW4zWTRG33K-8nW7NMj-x9bTNXYW954KlM4P0Yt6W2d4hSK3bWrh8W2YH1kR47xfHKW2HRyfR6trFPNW47YlYy4bfcHbW47Xx4z3C811XW4-SZvb2KQ2YYW3_VZwP5ThdHgW3s1XjF51G0BJW4Zh8Y-57-WqMW3H_Pty2DzCtRW1zBkSq1sQ3b4W8V-D1g5rcXhJW7JS0c27BQjYmVJB4Mm896Q7XW94B_1g7v78c8W8NqNPC5qWyC0W7JTtyJ2Xm03sW3FBZ5D9lNHw9W6_b40v3vyNkPW6J4Ypk8lBfs0W3bnqM_1C-9rFVL--5_1Pct9JW2mPjk95hqX5PW9lKhck4H6s3gN4m21WR6Q977Vb98_P6s16_2W8Ph58-59BvQ0W7y34GD1FmQY-W7r71Hq2PhWHMW7tprCG95RqNQW2j-Sgt2L5GhqW3G6xft6TMH99W6-cC_w3wXTtZW6Sytzy9fTwQmN3FYx-Q_HpmRf6dY7D511" target="_blank">agenda</a> for Spark Summit East 2015! This inaugural New York City event on <span class="aBn" tabindex="0" data-term="goog_929332804"><span class="aQJ">March 18-19, 2015</span></span> has over thirty jam-packed sessions – offering a ... | kavitha | {"createdOn":"2015-01-20","publishedOn":"2015-01-20","tz":"UTC"} | null | 2359 | https://databricks.com/blog/2015/01/20/spark-summit-east-2015-agenda-is-now-available.html | spark-summit-east-2015-agenda-is-now-available | publish | Spark Summit East 2015 Agenda is Now Available | 2015-01-20T00:00:00.000+0000 |
| ["Yin Huai (Databricks)"] | ["Apache Spark","Engineering Blog"] | [sidenote]Note: Starting Spark 1.3, SchemaRDD will be renamed to DataFrame.[/sidenote] <hr /> In this blog post, we introduce Spark SQL’s JSON support, a feature we have been working on at Databricks to make it dramatically easier to query and create JSON data in Spark. With the prevalence of web and mobile applications, JSON has become the de-facto interchange format for web service API’s as well as long-term storage. With existing tools, users often engineer complex pipelines to read and write JSON data sets within analytical systems. Spark SQL’s JSON support, released in Apache Spark 1.1 and enhanced in Apache Spark 1.2, vastly simplifies the end-to-end-experience of working with JSON data.<!--more--> <h2>Existing practices</h2> In practice, users often face difficulty in manipulating JSON data with modern analytical systems. To write a dataset to JSON format, users first need to write logic to convert their data to JSON. To read and query JSON datasets, a common practice is to us... | michael | {"createdOn":"2015-02-02","publishedOn":"2015-02-02","tz":"UTC"} | null | 2376 | https://databricks.com/blog/2015/02/02/an-introduction-to-json-support-in-spark-sql.html | an-introduction-to-json-support-in-spark-sql | publish | An introduction to JSON support in Spark SQL | 2015-02-02T00:00:00.000+0000 |
| ["Jeremy Freeman (Howard Hughes Medical Institute)"] | ["Apache Spark","Engineering Blog","Streaming"] | Many real world data are acquired sequentially over time, whether messages from social media users, time series from wearable sensors, or — in a case we are particularly excited about — the firing of large populations of neurons. In these settings, rather than wait for all the data to be acquired before performing our analyses, we can use streaming algorithms to identify patterns over time, and make more targeted predictions and decisions. One simple strategy is to build machine learning models on static data, and then use the learned model to make predictions on an incoming data stream. But what if the patterns in the data are themselves dynamic? That's where streaming algorithms come in. A key advantage of Apache Spark is that its machine learning library (MLlib) and its library for stream processing (Spark Streaming) are built on the same core architecture for distributed analytics. This facilitates adding extensions that leverage and combine components in novel ways without reinv... | Xiangrui | {"createdOn":"2015-01-28","publishedOn":"2015-01-28","tz":"UTC"} | null | 2382 | https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html | introducing-streaming-k-means-in-spark-1-2 | publish | Introducing streaming k-means in Apache Spark 1.2 | 2015-01-28T00:00:00.000+0000 |
| ["Dave Wang (Databricks)"] | ["Announcements","Company Blog"] | Recently <a href="http://www.infoworld.com/article/2871935/application-development/infoworlds-2015-technology-of-the-year-award-winners.html" target="_blank">Infoworld unveiled the 2015 Technology of the Year Award winners</a>, which range from open source software to stellar consumer technologies like the iPhone. Being the <a title="Announcing Spark 1.2" href="https://databricks.com/blog/2014/12/19/announcing-spark-1-2.html" target="_blank">creators behind Apache Spark</a>, Databricks is thrilled to see Spark in their ranks. In fact, we built our flagship product, <a title="Databricks Cloud Overview" href="https://databricks.com/product">Databricks</a>, on top of Spark with the ambition to revolutionize big data processing in ways similar to how iPhone revolutionized the mobile experience. The iPhone was revolutionary in a number of ways: first, it integrated a disparate set of consumer electronic capabilities such as mobile phone, camera, GPS, and even laptop; second, it created a... | dave_wang | {"createdOn":"2015-02-05","publishedOn":"2015-02-05","tz":"UTC"} | null | 2454 | https://databricks.com/blog/2015/02/05/apache-spark-selected-for-infoworld-2015-technology-of-the-year-award.html | apache-spark-selected-for-infoworld-2015-technology-of-the-year-award | publish | Apache Spark selected for Infoworld 2015 Technology of the Year Award | 2015-02-05T00:00:00.000+0000 |
| ["Patrick Wendell"] | ["Apache Spark","Engineering Blog"] | We at Databricks are thrilled to announce the release of Apache Spark 1.2! Apache Spark 1.2 introduces many new features along with scalability, usability and performance improvements. This post will introduce some key features of Apache Spark 1.2 and provide context on the priorities of Spark for this and the next release. In the next two weeks, we’ll be publishing blog posts with more details on feature additions in each of the major components. Apache Spark 1.2 has been posted today on the <a href="http://spark.apache.org/releases/spark-release-1-2-0.html">Apache Spark website</a>. Learn more about specific new features in related in-depth posts: <ul> <li><a title="Spark SQL Data Sources API: Unified Data Access for the Spark Platform" href="https://databricks.com/blog/2015/01/09/spark-sql-data-sources-api-unified-data-access-for-the-spark-platform.html" target="_blank">Spark SQL data sources API</a></li> <li><a title="An introduction to JSON support in Spark SQL" href="https:/... | patrick | {"createdOn":"2014-12-19","publishedOn":"2014-12-19","tz":"UTC"} | null | 2471 | https://databricks.com/blog/2014/12/19/announcing-spark-1-2.html | announcing-spark-1-2 | publish | Announcing Apache Spark 1.2 | 2014-12-19T00:00:00.000+0000 |
| ["Xiangrui Meng","Patrick Wendell"] | ["Apache Spark","Ecosystem","Engineering Blog"] | Today, we are happy to announce <em>Apache Spark Packages</em> (<a title="http://spark-packages.org" href="http://spark-packages.org">http://spark-packages.org</a>), a community package index to track the growing number of open source packages and libraries that work with Apache Spark. <em>Spark Packages</em> makes it easy for users to find, discuss, rate, and install packages for any version of Spark, and makes it easy for developers to contribute packages. <!--more--> <em>Spark Packages</em> will feature integrations with various data sources, management tools, higher level domain-specific libraries, machine learning algorithms, code samples, and other Spark content. Thanks to the package authors, the initial listing of packages includes <a href="http://spark-packages.org/package/6">scientific computing libraries</a>, a <a href="http://spark-packages.org/package/10">job execution server</a>, a connector for <a href="http://spark-packages.org/package/3">importing Avro data</a>, tool... | Xiangrui | {"createdOn":"2014-12-22","publishedOn":"2014-12-22","tz":"UTC"} | null | 2472 | https://databricks.com/blog/2014/12/22/announcing-spark-packages.html | announcing-spark-packages | publish | Announcing Apache Spark Packages | 2014-12-22T00:00:00.000+0000 |
| ["Xiangrui Meng","Joseph Bradley","Evan Sparks (UC Berkeley)","Shivaram Venkataraman (UC Berkeley)"] | ["Engineering Blog","Machine Learning"] | MLlib’s goal is to make practical machine learning (ML) scalable and easy. Besides new algorithms and performance improvements that we have seen in each release, a great deal of time and effort has been spent on making MLlib <i>easy</i>. Similar to Spark Core, MLlib provides APIs in three languages: Python, Java, and Scala, along with user guide and example code, to ease the learning curve for users coming from different backgrounds. In Apache Spark 1.2, Databricks, jointly with AMPLab, UC Berkeley, continues this effort by introducing a pipeline API to MLlib for easy creation and tuning of practical ML pipelines. A practical ML pipeline often involves a sequence of data pre-processing, feature extraction, model fitting, and validation stages. For example, classifying text documents might involve text segmentation and cleaning, extracting features, and training a classification model with cross-validation. Though there are many libraries we can use for each stage, connecting the dots ... | Xiangrui | {"createdOn":"2015-01-07","publishedOn":"2015-01-07","tz":"UTC"} | null | 2473 | https://databricks.com/blog/2015/01/07/ml-pipelines-a-new-high-level-api-for-mllib.html | ml-pipelines-a-new-high-level-api-for-mllib | publish | ML Pipelines: A New High-Level API for MLlib | 2015-01-07T00:00:00.000+0000 |
| ["Michael Armbrust"] | ["Apache Spark","Engineering Blog"] | Since the inception of Spark SQL in Apache Spark 1.0, one of its most popular uses has been as a conduit for pulling data into the Spark platform. Early users loved Spark SQL’s support for reading data from existing Apache Hive tables as well as from the popular Parquet columnar format. We’ve since added support for other formats, such as <a href="https://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets">JSON</a>. In Apache Spark 1.2, we've taken the next step to allow Spark to integrate natively with a far larger number of input sources. These new integrations are made possible through the inclusion of the new Spark SQL Data Sources API. <a href="https://databricks.com/wp-content/uploads/2015/01/DataSourcesApiDiagram.png"><img class="wp-image-2372 aligncenter" src="https://databricks.com/wp-content/uploads/2015/01/DataSourcesApiDiagram-1024x526.png" alt="DataSourcesApiDiagram" width="516" height="265" /></a> The Data Sources API provides a pluggable mechanism... | michael | {"createdOn":"2015-01-09","publishedOn":"2015-01-09","tz":"UTC"} | null | 2474 | https://databricks.com/blog/2015/01/09/spark-sql-data-sources-api-unified-data-access-for-the-spark-platform.html | spark-sql-data-sources-api-unified-data-access-for-the-spark-platform | publish | Spark SQL Data Sources API: Unified Data Access for the Apache Spark Platform | 2015-01-09T00:00:00.000+0000 |
| ["Joseph K. Bradley (Databricks)","Manish Amde (Origami Logic)"] | ["Apache Spark","Engineering Blog","Machine Learning"] | <div class="post-meta">This is a post written together with Manish Amde from <a href="http://www.origamilogic.com/">Origami Logic</a>.</div> <hr /> Apache Spark 1.2 introduces <a href="http://en.wikipedia.org/wiki/Random_forest">Random Forests</a> and <a href="http://en.wikipedia.org/wiki/Gradient_boosting#Gradient_tree_boosting">Gradient-Boosted Trees (GBTs)</a> into MLlib. Suitable for both classification and regression, they are among the most successful and widely deployed machine learning methods. Random Forests and GBTs are <i>ensemble learning algorithms</i>, which combine multiple decision trees to produce even more powerful models. In this post, we describe these models and the distributed implementation in MLlib. We also present simple examples and provide pointers on how to get started. <h2>Ensemble Methods</h2> Simply put, <a href="http://en.wikipedia.org/wiki/Ensemble_learning">ensemble learning algorithms</a> build upon other machine learning methods by combining models... | joseph | {"createdOn":"2015-01-21","publishedOn":"2015-01-21","tz":"UTC"} | null | 2475 | https://databricks.com/blog/2015/01/21/random-forests-and-boosting-in-mllib.html | random-forests-and-boosting-in-mllib | publish | Random Forests and Boosting in MLlib | 2015-01-21T00:00:00.000+0000 |
| ["Tathagata Das"] | ["Apache Spark","Engineering Blog","Streaming"] | Real-time stream processing systems must be operational 24/7, which requires them to recover from all kinds of failures in the system. Since its beginning, Apache Spark Streaming has included support for recovering from failures of both driver and worker machines. However, for some data sources, input data could get lost while recovering from the failures. In Apache Spark 1.2, we have added preliminary support for write ahead logs (also known as journaling) to Spark Streaming to improve this recovery mechanism and give stronger guarantees of zero data loss for more data sources. In this blog, we are going to elaborate on how this feature works and how developers can enable it to get those guarantees in Spark Streaming applications. <h2>Background</h2> Spark and its RDD abstraction is designed to seamlessly handle failures of any worker nodes in the cluster. Since Spark Streaming is built on Spark, it enjoys the same fault-tolerance for worker nodes. However, the demand of high uptimes ... | tdas | {"createdOn":"2015-01-15","publishedOn":"2015-01-15","tz":"UTC"} | null | 2476 | https://databricks.com/blog/2015/01/15/improved-driver-fault-tolerance-and-zero-data-loss-in-spark-streaming.html | improved-driver-fault-tolerance-and-zero-data-loss-in-spark-streaming | publish | Improved Fault-tolerance and Zero Data Loss in Apache Spark Streaming | 2015-01-15T00:00:00.000+0000 |
| ["Kavitha Mariappan"] | ["Announcements","Company Blog"] | In partnership with <a href="https://typesafe.com/">Typesafe</a>, we are excited to see the publication of the <a href="http://info.typesafe.com/COLL-20XX-Spark-Survey-Report_LP.html?lst=PR&lsd=COLL-20XX-Spark-Survey-Trends-Adoption-Report">survey report</a> representing the largest poll of Apache Spark developers to date. Spark is currently the most active open source project in big data and has been rapidly gaining traction over the past few years. This survey of over 2100 respondents further validates the wide variety of use cases and environments where it is being deployed. The survey results indicate that 13% are already using Spark in production environments with 20% of the respondents with plans to deploy Spark in production environments in 2015, and 31% are currently in the process of evaluating it. In total, the survey covers over 500 enterprises that are using or planning to use Spark in production environments ranging from on-premise Hadoop clusters to public clouds, wi... | kavitha | {"createdOn":"2015-01-27","publishedOn":"2015-01-27","tz":"UTC"} | null | 2477 | https://databricks.com/blog/2015/01/27/big-data-projects-are-hungry-for-simpler-and-more-powerful-tools-survey-validates-apache-spark-is-gaining-developer-traction.html | big-data-projects-are-hungry-for-simpler-and-more-powerful-tools-survey-validates-apache-spark-is-gaining-developer-traction | publish | Big data projects are hungry for simpler and more powerful tools: Survey validates Apache Spark is gaining developer traction! | 2015-01-27T00:00:00.000+0000 |
| ["Holden Karau","Andy Konwinski","Patrick Wendell","Matei Zaharia"] | ["Announcements","Company Blog"] | <a href="https://databricks.com/wp-content/uploads/2015/02/large-oreilly-book-cover.jpg"><img class="size-medium wp-image-2486 aligncenter" src="https://databricks.com/wp-content/uploads/2015/02/large-oreilly-book-cover-228x300.jpg" alt="large oreilly book cover" width="228" height="300" /></a> Today we are happy to announce that the complete <a href="http://shop.oreilly.com/product/0636920028512.do" target="_blank"><i>Learning Spark</i></a> book is available from O’Reilly in e-book form with the print copy expected to be available February 16th. At Databricks, as the creators behind Apache Spark, we have witnessed <a title="Big data projects are hungry for simpler and more powerful tools: Survey validates Apache Spark is gaining developer traction!" href="https://databricks.com/blog/2015/01/27/big-data-projects-are-hungry-for-simpler-and-more-powerful-tools-survey-validates-apache-spark-is-gaining-developer-traction.html" target="_blank">explosive growth in the interest and adoption ... | patrick | {"createdOn":"2015-02-09","publishedOn":"2015-02-09","tz":"UTC"} | null | 2479 | https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html | learning-spark-book-available-from-oreilly | publish | "Learning Spark" book available from O'Reilly | 2015-02-09T00:00:00.000+0000 |
| null | ["Announcements","Company Blog","Customers"] | We're really excited to share that <a href="http://www.automatic.com">Automatic Labs </a>has selected Databricks as its preferred big data processing platform. Press release: <a href="http://www.marketwired.com/press-release/automatic-labs-turns-databricks-cloud-faster-innovation-dramatic-cost-savings-1991316.htm" target="_blank">http://www.marketwired.com/press-release/automatic-labs-turns-databricks-cloud-faster-innovation-dramatic-cost-savings-1991316.htm</a> Automatic Labs needed to run large and complex queries against their entire data set to explore and come up with new product ideas. Their prior solution using Postgres impeded the ability of Automatic’s team to efficiently explore data because queries took days to run and data could not be easily visualized, preventing Automatic Labs from bringing critical new products to market. They then deployed Databricks, our simple yet powerful unified big data processing platform on Amazon Web Services (AWS) and realized these key bene... | kavitha | {"createdOn":"2015-02-13","publishedOn":"2015-02-13","tz":"UTC"} | null | 2566 | https://databricks.com/blog/2015/02/12/automatic-labs-selects-databricks-cloud-for-primary-real-time-data-processing.html | automatic-labs-selects-databricks-cloud-for-primary-real-time-data-processing | publish | Automatic Labs Selects Databricks for Primary Real-Time Data Processing | 2015-02-13T00:00:00.000+0000 |
| null | ["Apache Spark","Engineering Blog"] | 2014 has been a year of <a href="https://databricks.com/blog/2015/01/27/big-data-projects-are-hungry-for-simpler-and-more-powerful-tools-survey-validates-apache-spark-is-gaining-developer-traction.html">tremendous growth</a> for Apache Spark. It became the most active open source project in the Big Data ecosystem with over 400 contributors, and was adopted by many platform vendors - including all of the major Hadoop distributors. Through our ecosystem of products, partners, and training at Databricks, we also saw over 200 enterprises deploying Spark in production. To help Spark achieve this growth, Databricks has worked broadly throughout the project to improve functionality and ease of use. Indeed, while the community has grown a lot, about 75% of the code added to Spark last year came from Databricks. In this post, we would like to highlight some of the additions we made to Spark in 2014, and provide a preview of our priorities in 2015. In general, our approach to developing Spar... | patrick | {"createdOn":"2015-02-14","publishedOn":"2015-02-14","tz":"UTC"} | Spark: A review of 2014 and looking ahead to 2015 priorities | 2576 | https://databricks.com/blog/2015/02/13/spark-a-review-of-2014-and-looking-ahead-to-2015-priorities.html | spark-a-review-of-2014-and-looking-ahead-to-2015-priorities | publish | Apache Spark: A review of 2014 and looking ahead to 2015 priorities | 2015-02-14T00:00:00.000+0000 |
| null | ["Company Blog","Partners"] | This is a guest blog from our one of our partners: <a href="http://www.memsql.com/" target="_blank">MemSQL</a> <hr /> <h2>Summary</h2> Coupling operational data with the most advanced analytics puts data-driven business ahead. The MemSQL Apache Spark Connector enables such configurations. <h2>Meeting Transactional and Analytical Needs</h2> Transactional databases form the core of modern business operations. Whether that transaction is financial, physical in terms of inventory changes, or experiential in terms of a customer engagement, the transaction itself moves our business forward. But while transactions represent the state of our business, analytics tell us patterns of the past, and help us predict patterns of the future. Analytics can tell us what levers influence profitability and put us ahead of the pack. Success in digital business requires both transactional and analytical prowess, including the foremost means to analyze data. <h2>Speed and Agility with MemSQL and A... | dave_wang | {"createdOn":"2015-02-19","publishedOn":"2015-02-19","tz":"UTC"} | null | 2749 | https://databricks.com/blog/2015/02/19/extending-memsql-analytics-with-spark.html | extending-memsql-analytics-with-spark | publish | Extending MemSQL Analytics with Apache Spark | 2015-02-19T00:00:00.000+0000 |
| null | ["Apache Spark","Engineering Blog"] | Today, we are excited to announce a new DataFrame API designed to make big data processing even easier for a wider audience. When we first open sourced Apache Spark, we aimed to provide a simple API for distributed data processing in general-purpose programming languages (Java, Python, Scala). Spark enabled distributed data processing through functional transformations on distributed collections of data (RDDs). This was an incredibly powerful API: tasks that used to take thousands of lines of code to express could be reduced to dozens. As Spark continues to grow, we want to enable wider audiences beyond “Big Data” engineers to leverage the power of distributed processing. The new DataFrames API was created with this goal in mind. This API is inspired by data frames in R and Python (Pandas), but designed from the ground-up to support modern big data and data science applications. As an extension to the existing RDD API, DataFrames feature: <ul> <li>Ability to scale from kilobytes o... | rxin | {"createdOn":"2015-02-17","publishedOn":"2015-02-17","tz":"UTC"} | null | 2757 | https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html | introducing-dataframes-in-spark-for-large-scale-data-science | publish | Introducing DataFrames in Apache Spark for Large Scale Data Science | 2015-02-17T00:00:00.000+0000 |
| null | ["Company Blog","Events"] | The Strata + Hadoop World Conference in San Jose last week was abuzz with "putting data to work" in keeping with this year's conference theme. This was a significant shift from last year's event where organizations were highly focused on getting their arms around their big data projects and being steeped in evaluating the multitude of tools of new technologies available. Last week's event highlighted what is top of mind for enterprises and developers alike - how to turn their big data initiatives and projects into real business results? One theme was loud and clear - Apache Spark's flame shone bright! Derrick Harris from GigaOM summed this up aptly in his article "<a href="https://gigaom.com/2015/02/20/for-now-spark-looks-like-the-future-of-big-data/" target="_blank">For now, Spark looks like the future of big data</a>". To quote Derrick, <em>"Titles can be misleading. For example, the O’Reilly Strata + Hadoop World conference took place in San Jose, California, this week but Hadoop ... | dave_wang | {"createdOn":"2015-02-24","publishedOn":"2015-02-24","tz":"UTC"} | null | 2830 | https://databricks.com/blog/2015/02/24/databricks-at-strata-san-jose.html | databricks-at-strata-san-jose | publish | Databricks at Strata San Jose | 2015-02-24T00:00:00.000+0000 |
| null | ["Company Blog","Product"] | <div class="article-body"> Enterprises have been collecting ever-larger amounts of data with the goal of extracting insights and creating value. Yet despite a few innovative companies who are able to successfully exploit big data, the promised returns of big data remain elusive beyond the grasp of many enterprises. One notable and rapidly growing open source technology that has emerged in the big data space is Apache Spark. Spark is an open source data processing framework that was built for speed, ease of use, and scale. Much of its benefits are due to how it unifies critical data analytics capabilities such as SQL, machine learning and streaming in a single framework. This enables enterprises to simultaneously achieve high performance computing at scale while simplifying their data processing infrastructure by avoiding the difficult integration of many disparate and difficult tools with a single powerful yet simple alternative. While Spark appears to have the potential to solve m... | kavitha | {"createdOn":"2015-03-04","publishedOn":"2015-03-04","tz":"UTC"} | null | 2871 | https://databricks.com/blog/2015/03/04/databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instant.html | databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instant | publish | Databricks: From raw data, to insights and data products in an instant! | 2015-03-04T00:00:00.000+0000 |
| authors | categories | content | creator | dates | description | id | link | slug | status | title | publishedOn |
|---|
Showing the first 152 rows.
Last refresh: Never
databricksBlog2DF.printSchema()
%md-sandbox Since the dates are represented by a `timestamp` data type, we need to convert to a data type that allows `<` and `>`-type comparison operations in order to query for articles within certain date ranges (such as a list of all articles published in 2013). This is accopmplished by using the `to_date` function in Scala or Python. <img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> See the Spark documentation on <a href="https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$" target="_blank">built-in functions</a>, for a long list of date-specific functions.
Last refresh: Never
from pyspark.sql.functions import to_timestamp, year, col resultDF = (databricksBlog2DF.select("title", to_timestamp(col("publishedOn"),"MMM dd, yyyy").alias('date'),"link") .filter(year(col("publishedOn")) == '2013') .orderBy(col("publishedOn")) ) display(resultDF)
| Databricks and the Apache Spark Platform | 2013-10-27T00:00:00.000+0000 | https://databricks.com/blog/2013/10/27/databricks-and-the-apache-spark-platform.html |
| The Growing Apache Spark Community | 2013-10-28T00:00:00.000+0000 | https://databricks.com/blog/2013/10/27/the-growing-spark-community.html |
| Databricks and Cloudera Partner to Support Apache Spark | 2013-10-29T00:00:00.000+0000 | https://databricks.com/blog/2013/10/28/databricks-and-cloudera-partner-to-support-spark.html |
| Putting Apache Spark to Use: Fast In-Memory Computing for Your Big Data Applications | 2013-11-22T00:00:00.000+0000 | https://databricks.com/blog/2013/11/21/putting-spark-to-use.html |
| Highlights From Spark Summit 2013 | 2013-12-19T00:00:00.000+0000 | https://databricks.com/blog/2013/12/18/spark-summit-2013-follow-up.html |
| Apache Spark 0.8.1 Released | 2013-12-20T00:00:00.000+0000 | https://databricks.com/blog/2013/12/19/release-0_8_1.html |
| title | date | link |
|---|
Last refresh: Never
%md ## Array Data The DataFrame also contains array columns. Easily determine the size of each array using the built-in `size(..)` function with array columns.
Array Data
The DataFrame also contains array columns.
Easily determine the size of each array using the built-in size(..) function with array columns.
Last refresh: Never
from pyspark.sql.functions import size display(databricksBlogDF.select(size("authors"),"authors"))
| 1 | ["Tomer Shiran (VP of Product Management at MapR)"] |
| 1 | ["Tathagata Das"] |
| 1 | ["Steven Hillion"] |
| 2 | ["Michael Armbrust","Reynold Xin"] |
| 1 | ["Patrick Wendell"] |
| 2 | ["Ali Ghodsi","Ahir Reddy"] |
| 2 | ["Russell Cardullo (Data Infrastructure Engineer at Sharethrough)","Michael Ruggiero (Data Infrastructure Engineer at Sharethrough)"] |
| 2 | ["Jai Ranganathan","Matei Zaharia"] |
| 1 | ["Databricks Press Office"] |
| 1 | ["Ion Stoica"] |
| 2 | ["Ahir Reddy","Reynold Xin"] |
| 1 | ["Pat McDonough"] |
| 1 | ["Ion Stoica"] |
| 1 | ["Patrick Wendell"] |
| 1 | ["Andy Konwinski"] |
| 1 | ["Pat McDonough"] |
| 1 | ["Ion Stoica"] |
| 1 | ["Matei Zaharia"] |
| 2 | ["Ion Stoica","Matei Zaharia"] |
| 1 | ["Arsalan Tavakoli-Shiraji"] |
| 2 | ["Prashant Sharma","Matei Zaharia"] |
| 1 | ["Databricks Training Team"] |
| 1 | ["Claudiu Barbura (Sr. Dir. of Engineering at Atigeo LLC)"] |
| 1 | ["Sarabjeet Chugh (Head of Hadoop Product Management at Pivotal Inc.)"] |
| 1 | ["Patrick Wendell"] |
| 2 | ["Michael Armbrust","Zongheng Yang"] |
| 1 | ["Michael Hiskey (VP at MicroStrategy Inc.)"] |
| 1 | ["Christopher Nguyen (CEO & Co-Founder of Adatao)"] |
| 1 | ["Databricks Press Office"] |
| 1 | ["Dean Wampler (Typesafe)"] |
| 1 | ["Hari Kodakalla (EVP at Apervi Inc.)"] |
| 1 | ["Bill Kehoe (Big Data Architect at Qlik)"] |
| 1 | ["Databricks Press Office"] |
| 1 | ["Costin Leau (Engineer at Elasticsearch)"] |
| 1 | ["Jake Cornelius (SVP of Product Management at Pentaho)"] |
| 1 | ["SriSatish Ambati (CEO of 0xData)"] |
| 1 | ["Databricks Press Office"] |
| 1 | ["Arsalan Tavakoli-Shiraji"] |
| 1 | ["Arsalan Tavakoli-Shiraji"] |
| 1 | ["Databricks Press Office"] |
| 1 | ["Databricks Press Office"] |
| 1 | ["Arsalan Tavakoli-Shiraji"] |
| 1 | ["Reynold Xin"] |
| 1 | ["Ion Stoica"] |
| 1 | ["Xiangrui Meng"] |
| 1 | ["Matei Zaharia"] |
| 3 | ["Burak Yavuz","Xiangrui Meng","Reynold Xin"] |
| 2 | ["Li Pu","Reza Zadeh"] |
| 1 | ["Scott Walent"] |
| 1 | ["Oscar Mendez (CEO of Stratio)"] |
| 2 | ["Andy Huang (Alibaba Taobao Data Mining Team)","Wei Wu (Alibaba Taobao Data Mining Team)"] |
| 4 | ["Doris Xin","Burak Yavuz","Xiangrui Meng","Hossein Falaki"] |
| 1 | ["Patrick Wendell"] |
| 3 | ["Arsalan Tavakoli-Shiraji","Tathagata Das","Patrick Wendell"] |
| 2 | ["Burak Yavuz","Xiangrui Meng"] |
| 1 | ["Gavin Targonski (Product Management at Talend)"] |
| 2 | ["Nick Pentreath (Graphflow)","Kan Zhang (IBM)"] |
| 1 | ["Vida Ha"] |
| 2 | ["John Tripier","Paco Nathan"] |
| 1 | ["Christopher Burdorf (Senior Software Engineer at NBC Universal)"] |
| 2 | ["Manish Amde (Origami Logic)","Joseph Bradley (Databricks)"] |
| 1 | ["Eric Carr (VP Core Systems Group at Guavus)"] |
| 1 | ["Jeremy Freeman (Freeman Lab)"] |
| 1 | ["Russell Cardullo (Sharethrough)"] |
| 1 | ["Sean Kandel (CTO at Trifacta)"] |
| 1 | ["Reynold Xin"] |
| 1 | ["Reza Zadeh"] |
| 1 | ["Jeff Feng (Product Manager at Tableau Software)"] |
| 1 | ["Scott Walent"] |
| 2 | ["Ari Himmel (CEO at Faimdata)","Nan Zhu (Chief Architect at Faimdata)"] |
| 1 | ["John Kreisa (VP of Strategic Marketing at Hortonworks)"] |
| 1 | ["Sachin Chawla (VP of Engineering)"] |
| 1 | ["Sonal Goyal (CEO)"] |
| 1 | [" Dibyendu Bhattacharya (Big Data Architect)"] |
| 1 | ["Reynold Xin"] |
| 1 | ["Matt MacKinnon (Director of Product Management at Zaloni)"] |
| 2 | ["John Tripier","Paco Nathan"] |
| 3 | ["Luis Quintela (Sr. Manager of Big Data Analytics)","Yan Breek (Data Scientist)","Girish Kathalagiri (Data Analytics Engineer)"] |
| 2 | ["Ameet Talwalkar","Anthony Joseph"] |
| 1 | ["Lieven Gesquiere (Virdata Lead Core R&D)"] |
| 1 | ["by Databricks Press Office"] |
| 1 | ["Kavitha Mariappan"] |
| 1 | ["Kavitha Mariappan"] |
| 1 | ["Yin Huai (Databricks)"] |
| 1 | ["Jeremy Freeman (Howard Hughes Medical Institute)"] |
| 1 | ["Dave Wang (Databricks)"] |
| 1 | ["Patrick Wendell"] |
| 2 | ["Xiangrui Meng","Patrick Wendell"] |
| 4 | ["Xiangrui Meng","Joseph Bradley","Evan Sparks (UC Berkeley)","Shivaram Venkataraman (UC Berkeley)"] |
| 1 | ["Michael Armbrust"] |
| 2 | ["Joseph K. Bradley (Databricks)","Manish Amde (Origami Logic)"] |
| 1 | ["Tathagata Das"] |
| 1 | ["Kavitha Mariappan"] |
| 4 | ["Holden Karau","Andy Konwinski","Patrick Wendell","Matei Zaharia"] |
| -1 | null |
| -1 | null |
| -1 | null |
| -1 | null |
| -1 | null |
| -1 | null |
| size(authors) | authors |
|---|
Last refresh: Never
%md Pull the first element from the array `authors` using an array subscript operator. For example, in Scala, the 0th element of array `authors` is `authors(0)` whereas, in Python, the 0th element of `authors` is `authors[0]`.
Pull the first element from the array authors using an array subscript operator.
For example, in Scala, the 0th element of array authors is authors(0)
whereas, in Python, the 0th element of authors is authors[0].
Last refresh: Never
display(databricksBlogDF.select(col("authors")[0].alias("primaryAuthor")))
| Tomer Shiran (VP of Product Management at MapR) |
| Tathagata Das |
| Steven Hillion |
| Michael Armbrust |
| Patrick Wendell |
| Ali Ghodsi |
| Russell Cardullo (Data Infrastructure Engineer at Sharethrough) |
| Jai Ranganathan |
| Databricks Press Office |
| Ion Stoica |
| Ahir Reddy |
| Pat McDonough |
| Ion Stoica |
| Patrick Wendell |
| Andy Konwinski |
| Pat McDonough |
| Ion Stoica |
| Matei Zaharia |
| Ion Stoica |
| Arsalan Tavakoli-Shiraji |
| Prashant Sharma |
| Databricks Training Team |
| Claudiu Barbura (Sr. Dir. of Engineering at Atigeo LLC) |
| Sarabjeet Chugh (Head of Hadoop Product Management at Pivotal Inc.) |
| Patrick Wendell |
| Michael Armbrust |
| Michael Hiskey (VP at MicroStrategy Inc.) |
| Christopher Nguyen (CEO & Co-Founder of Adatao) |
| Databricks Press Office |
| Dean Wampler (Typesafe) |
| Hari Kodakalla (EVP at Apervi Inc.) |
| Bill Kehoe (Big Data Architect at Qlik) |
| Databricks Press Office |
| Costin Leau (Engineer at Elasticsearch) |
| Jake Cornelius (SVP of Product Management at Pentaho) |
| SriSatish Ambati (CEO of 0xData) |
| Databricks Press Office |
| Arsalan Tavakoli-Shiraji |
| Arsalan Tavakoli-Shiraji |
| Databricks Press Office |
| Databricks Press Office |
| Arsalan Tavakoli-Shiraji |
| Reynold Xin |
| Ion Stoica |
| Xiangrui Meng |
| Matei Zaharia |
| Burak Yavuz |
| Li Pu |
| Scott Walent |
| Oscar Mendez (CEO of Stratio) |
| Andy Huang (Alibaba Taobao Data Mining Team) |
| Doris Xin |
| Patrick Wendell |
| Arsalan Tavakoli-Shiraji |
| Burak Yavuz |
| Gavin Targonski (Product Management at Talend) |
| Nick Pentreath (Graphflow) |
| Vida Ha |
| John Tripier |
| Christopher Burdorf (Senior Software Engineer at NBC Universal) |
| Manish Amde (Origami Logic) |
| Eric Carr (VP Core Systems Group at Guavus) |
| Jeremy Freeman (Freeman Lab) |
| Russell Cardullo (Sharethrough) |
| Sean Kandel (CTO at Trifacta) |
| Reynold Xin |
| Reza Zadeh |
| Jeff Feng (Product Manager at Tableau Software) |
| Scott Walent |
| Ari Himmel (CEO at Faimdata) |
| John Kreisa (VP of Strategic Marketing at Hortonworks) |
| Sachin Chawla (VP of Engineering) |
| Sonal Goyal (CEO) |
| Dibyendu Bhattacharya (Big Data Architect) |
| Reynold Xin |
| Matt MacKinnon (Director of Product Management at Zaloni) |
| John Tripier |
| Luis Quintela (Sr. Manager of Big Data Analytics) |
| Ameet Talwalkar |
| Lieven Gesquiere (Virdata Lead Core R&D) |
| by Databricks Press Office |
| Kavitha Mariappan |
| Kavitha Mariappan |
| Yin Huai (Databricks) |
| Jeremy Freeman (Howard Hughes Medical Institute) |
| Dave Wang (Databricks) |
| Patrick Wendell |
| Xiangrui Meng |
| Xiangrui Meng |
| Michael Armbrust |
| Joseph K. Bradley (Databricks) |
| Tathagata Das |
| Kavitha Mariappan |
| Holden Karau |
| null |
| null |
| null |
| null |
| null |
| null |
| primaryAuthor |
|---|
Last refresh: Never
%md ### Explode The `explode` method allows you to split an array column into multiple rows, copying all the other columns into each new row. For example, split the column `authors` into the column `author`, with one author per row.
Explode
The explode method allows you to split an array column into multiple rows, copying all the other columns into each new row.
For example, split the column authors into the column author, with one author per row.
Last refresh: Never
from pyspark.sql.functions import explode display(databricksBlogDF.select("title","authors",explode(col("authors")).alias("author"), "link"))
| MapR Integrates the Complete Apache Spark Stack | ["Tomer Shiran (VP of Product Management at MapR)"] | Tomer Shiran (VP of Product Management at MapR) | https://databricks.com/blog/2014/04/10/mapr-integrates-spark-stack.html |
| Apache Spark 0.9.1 Released | ["Tathagata Das"] | Tathagata Das | https://databricks.com/blog/2014/04/09/spark-0_9_1-released.html |
| Application Spotlight: Alpine Data Labs | ["Steven Hillion"] | Steven Hillion | https://databricks.com/blog/2014/03/31/application-spotlight-alpine.html |
| Spark SQL: Manipulating Structured Data Using Apache Spark | ["Michael Armbrust","Reynold Xin"] | Michael Armbrust | https://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-data-using-spark-2.html |
| Spark SQL: Manipulating Structured Data Using Apache Spark | ["Michael Armbrust","Reynold Xin"] | Reynold Xin | https://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-data-using-spark-2.html |
| Apache Spark 0.9.0 Released | ["Patrick Wendell"] | Patrick Wendell | https://databricks.com/blog/2014/02/03/release-0_9_0.html |
| Apache Spark In MapReduce (SIMR) | ["Ali Ghodsi","Ahir Reddy"] | Ali Ghodsi | https://databricks.com/blog/2014/01/01/simr.html |
| Apache Spark In MapReduce (SIMR) | ["Ali Ghodsi","Ahir Reddy"] | Ahir Reddy | https://databricks.com/blog/2014/01/01/simr.html |
| Sharethrough Uses Apache Spark Streaming to Optimize Bidding in Real Time | ["Russell Cardullo (Data Infrastructure Engineer at Sharethrough)","Michael Ruggiero (Data Infrastructure Engineer at Sharethrough)"] | Russell Cardullo (Data Infrastructure Engineer at Sharethrough) | https://databricks.com/blog/2014/03/25/sharethrough-and-spark-streaming.html |
| Sharethrough Uses Apache Spark Streaming to Optimize Bidding in Real Time | ["Russell Cardullo (Data Infrastructure Engineer at Sharethrough)","Michael Ruggiero (Data Infrastructure Engineer at Sharethrough)"] | Michael Ruggiero (Data Infrastructure Engineer at Sharethrough) | https://databricks.com/blog/2014/03/25/sharethrough-and-spark-streaming.html |
| Apache Spark: A Delight for Developers | ["Jai Ranganathan","Matei Zaharia"] | Jai Ranganathan | https://databricks.com/blog/2014/03/20/apache-spark-a-delight-for-developers.html |
| Apache Spark: A Delight for Developers | ["Jai Ranganathan","Matei Zaharia"] | Matei Zaharia | https://databricks.com/blog/2014/03/20/apache-spark-a-delight-for-developers.html |
| Databricks announces "Certified on Apache Spark" Program | ["Databricks Press Office"] | Databricks Press Office | https://databricks.com/blog/2014/03/18/spark-certification.html |
| Apache Spark Now a Top-level Apache Project | ["Ion Stoica"] | Ion Stoica | https://databricks.com/blog/2014/03/02/spark-apache-top-level-project.html |
| AMPLab updates the Big Data Benchmark | ["Ahir Reddy","Reynold Xin"] | Ahir Reddy | https://databricks.com/blog/2014/02/12/big-data-benchmark.html |
| AMPLab updates the Big Data Benchmark | ["Ahir Reddy","Reynold Xin"] | Reynold Xin | https://databricks.com/blog/2014/02/12/big-data-benchmark.html |
| Databricks at the O'Reilly Strata Conference 2014 | ["Pat McDonough"] | Pat McDonough | https://databricks.com/blog/2014/02/10/strata-santa-clara-2014.html |
| Apache Spark and Hadoop: Working Together | ["Ion Stoica"] | Ion Stoica | https://databricks.com/blog/2014/01/21/spark-and-hadoop.html |
| Apache Spark 0.8.1 Released | ["Patrick Wendell"] | Patrick Wendell | https://databricks.com/blog/2013/12/19/release-0_8_1.html |
| Highlights From Spark Summit 2013 | ["Andy Konwinski"] | Andy Konwinski | https://databricks.com/blog/2013/12/18/spark-summit-2013-follow-up.html |
| Putting Apache Spark to Use: Fast In-Memory Computing for Your Big Data Applications | ["Pat McDonough"] | Pat McDonough | https://databricks.com/blog/2013/11/21/putting-spark-to-use.html |
| Databricks and Cloudera Partner to Support Apache Spark | ["Ion Stoica"] | Ion Stoica | https://databricks.com/blog/2013/10/28/databricks-and-cloudera-partner-to-support-spark.html |
| The Growing Apache Spark Community | ["Matei Zaharia"] | Matei Zaharia | https://databricks.com/blog/2013/10/27/the-growing-spark-community.html |
| Databricks and the Apache Spark Platform | ["Ion Stoica","Matei Zaharia"] | Ion Stoica | https://databricks.com/blog/2013/10/27/databricks-and-the-apache-spark-platform.html |
| Databricks and the Apache Spark Platform | ["Ion Stoica","Matei Zaharia"] | Matei Zaharia | https://databricks.com/blog/2013/10/27/databricks-and-the-apache-spark-platform.html |
| Databricks and MapR | ["Arsalan Tavakoli-Shiraji"] | Arsalan Tavakoli-Shiraji | https://databricks.com/blog/2014/04/10/partnership-between-databricks-and-mapr.html |
| Making Apache Spark Easier to Use in Java with Java 8 | ["Prashant Sharma","Matei Zaharia"] | Prashant Sharma | https://databricks.com/blog/2014/04/14/spark-with-java-8.html |
| Making Apache Spark Easier to Use in Java with Java 8 | ["Prashant Sharma","Matei Zaharia"] | Matei Zaharia | https://databricks.com/blog/2014/04/14/spark-with-java-8.html |
| Databricks Announces Apache Spark Training Workshops | ["Databricks Training Team"] | Databricks Training Team | https://databricks.com/blog/2014/06/02/databricks-hands-on-technical-workshops.html |
| Application Spotlight: Atigeo xPatterns | ["Claudiu Barbura (Sr. Dir. of Engineering at Atigeo LLC)"] | Claudiu Barbura (Sr. Dir. of Engineering at Atigeo LLC) | https://databricks.com/blog/2014/05/22/application-spotlight-atigeo-xpatterns.html |
| Pivotal Hadoop Integrates the Full Apache Spark Stack | ["Sarabjeet Chugh (Head of Hadoop Product Management at Pivotal Inc.)"] | Sarabjeet Chugh (Head of Hadoop Product Management at Pivotal Inc.) | https://databricks.com/blog/2014/05/23/pivotal-hadoop-integrates-the-full-apache-spark-stack.html |
| Announcing Apache Spark 1.0 | ["Patrick Wendell"] | Patrick Wendell | https://databricks.com/blog/2014/05/30/announcing-spark-1-0.html |
| Exciting Performance Improvements on the Horizon for Spark SQL | ["Michael Armbrust","Zongheng Yang"] | Michael Armbrust | https://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html |
| Exciting Performance Improvements on the Horizon for Spark SQL | ["Michael Armbrust","Zongheng Yang"] | Zongheng Yang | https://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html |
| MicroStrategy "Certified on Apache Spark" | ["Michael Hiskey (VP at MicroStrategy Inc.)"] | Michael Hiskey (VP at MicroStrategy Inc.) | https://databricks.com/blog/2014/06/04/microstrategy-certified-on-spark.html |
| Application Spotlight: Arimo | ["Christopher Nguyen (CEO & Co-Founder of Adatao)"] | Christopher Nguyen (CEO & Co-Founder of Adatao) | https://databricks.com/blog/2014/06/11/application-spotlight-arimo.html |
| Spark Summit 2014 Brings Together Apache Spark Community | ["Databricks Press Office"] | Databricks Press Office | https://databricks.com/blog/2014/06/11/spark-summit-2014-brings-together-apache-spark-community.html |
| Application Spotlight: Lightbend | ["Dean Wampler (Typesafe)"] | Dean Wampler (Typesafe) | https://databricks.com/blog/2014/06/13/application-spotlight-lightbend.html |
| Application Spotlight: Apervi | ["Hari Kodakalla (EVP at Apervi Inc.)"] | Hari Kodakalla (EVP at Apervi Inc.) | https://databricks.com/blog/2014/06/23/application-spotlight-apervi.html |
| Application Spotlight: Qlik | ["Bill Kehoe (Big Data Architect at Qlik)"] | Bill Kehoe (Big Data Architect at Qlik) | https://databricks.com/blog/2014/06/24/application-spotlight-qlik.html |
| Databricks Launches "Certified Apache Spark Distribution" Program | ["Databricks Press Office"] | Databricks Press Office | https://databricks.com/blog/2014/06/26/databricks-launches-certified-spark-distribution-program.html |
| Application Spotlight: Elasticsearch | ["Costin Leau (Engineer at Elasticsearch)"] | Costin Leau (Engineer at Elasticsearch) | https://databricks.com/blog/2014/06/27/application-spotlight-elasticsearch.html |
| Application Spotlight: Pentaho | ["Jake Cornelius (SVP of Product Management at Pentaho)"] | Jake Cornelius (SVP of Product Management at Pentaho) | https://databricks.com/blog/2014/06/30/application-spotlight-pentaho.html |
| Sparkling Water = H20 + Apache Spark | ["SriSatish Ambati (CEO of 0xData)"] | SriSatish Ambati (CEO of 0xData) | https://databricks.com/blog/2014/06/30/sparkling-water-h20-spark.html |
| Databricks Unveils Apache Spark-Based Cloud Platform; Announces Series B Funding | ["Databricks Press Office"] | Databricks Press Office | https://databricks.com/blog/2014/06/30/databricks-unveils-spark-based-cloud-platform.html |
| Databricks Application Spotlight at Spark Summit 2014 | ["Arsalan Tavakoli-Shiraji"] | Arsalan Tavakoli-Shiraji | https://databricks.com/blog/2014/04/28/databricks-application-spotlight-at-spark-summit-2014.html |
| Databricks and Datastax | ["Arsalan Tavakoli-Shiraji"] | Arsalan Tavakoli-Shiraji | https://databricks.com/blog/2014/05/08/databricks-and-datastax.html |
| Databricks Partners with Simba to Deliver Shark ODBC Driver | ["Databricks Press Office"] | Databricks Press Office | https://databricks.com/blog/2014/04/30/databricks-partners-with-simba-to-deliver-shark-odbc-driver.html |
| Databricks Announces Partnership with SAP | ["Databricks Press Office"] | Databricks Press Office | https://databricks.com/blog/2014/07/01/databricks-announces-partnership-with-sap.html |
| Integrating Apache Spark and HANA | ["Arsalan Tavakoli-Shiraji"] | Arsalan Tavakoli-Shiraji | https://databricks.com/blog/2014/07/01/integrating-spark-and-hana.html |
| Shark, Spark SQL, Hive on Spark, and the future of SQL on Apache Spark | ["Reynold Xin"] | Reynold Xin | https://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html |
| Databricks: Making Big Data Easy | ["Ion Stoica"] | Ion Stoica | https://databricks.com/blog/2014/07/14/databricks-cloud-making-big-data-easy.html |
| New Features in MLlib in Apache Spark 1.0 | ["Xiangrui Meng"] | Xiangrui Meng | https://databricks.com/blog/2014/07/16/new-features-in-mllib-in-spark-1-0.html |
| The State of Apache Spark in 2014 | ["Matei Zaharia"] | Matei Zaharia | https://databricks.com/blog/2014/07/18/the-state-of-apache-spark-in-2014.html |
| Scalable Collaborative Filtering with Apache Spark MLlib | ["Burak Yavuz","Xiangrui Meng","Reynold Xin"] | Burak Yavuz | https://databricks.com/blog/2014/07/23/scalable-collaborative-filtering-with-spark-mllib.html |
| Scalable Collaborative Filtering with Apache Spark MLlib | ["Burak Yavuz","Xiangrui Meng","Reynold Xin"] | Xiangrui Meng | https://databricks.com/blog/2014/07/23/scalable-collaborative-filtering-with-spark-mllib.html |
| Scalable Collaborative Filtering with Apache Spark MLlib | ["Burak Yavuz","Xiangrui Meng","Reynold Xin"] | Reynold Xin | https://databricks.com/blog/2014/07/23/scalable-collaborative-filtering-with-spark-mllib.html |
| Distributing the Singular Value Decomposition with Apache Spark | ["Li Pu","Reza Zadeh"] | Li Pu | https://databricks.com/blog/2014/07/21/distributing-the-singular-value-decomposition-with-spark.html |
| Distributing the Singular Value Decomposition with Apache Spark | ["Li Pu","Reza Zadeh"] | Reza Zadeh | https://databricks.com/blog/2014/07/21/distributing-the-singular-value-decomposition-with-spark.html |
| Spark Summit 2014 Highlights | ["Scott Walent"] | Scott Walent | https://databricks.com/blog/2014/07/22/spark-summit-2014-highlights.html |
| When Stratio Met Apache Spark: A True Love Story | ["Oscar Mendez (CEO of Stratio)"] | Oscar Mendez (CEO of Stratio) | https://databricks.com/blog/2014/08/08/when-stratio-met-spark-a-true-love-story.html |
| Mining Ecommerce Graph Data with Apache Spark at Alibaba Taobao | ["Andy Huang (Alibaba Taobao Data Mining Team)","Wei Wu (Alibaba Taobao Data Mining Team)"] | Andy Huang (Alibaba Taobao Data Mining Team) | https://databricks.com/blog/2014/08/14/mining-graph-data-with-spark-at-alibaba-taobao.html |
| Mining Ecommerce Graph Data with Apache Spark at Alibaba Taobao | ["Andy Huang (Alibaba Taobao Data Mining Team)","Wei Wu (Alibaba Taobao Data Mining Team)"] | Wei Wu (Alibaba Taobao Data Mining Team) | https://databricks.com/blog/2014/08/14/mining-graph-data-with-spark-at-alibaba-taobao.html |
| Statistics Functionality in Apache Spark 1.1 | ["Doris Xin","Burak Yavuz","Xiangrui Meng","Hossein Falaki"] | Doris Xin | https://databricks.com/blog/2014/08/27/statistics-functionality-in-spark.html |
| Statistics Functionality in Apache Spark 1.1 | ["Doris Xin","Burak Yavuz","Xiangrui Meng","Hossein Falaki"] | Burak Yavuz | https://databricks.com/blog/2014/08/27/statistics-functionality-in-spark.html |
| Statistics Functionality in Apache Spark 1.1 | ["Doris Xin","Burak Yavuz","Xiangrui Meng","Hossein Falaki"] | Xiangrui Meng | https://databricks.com/blog/2014/08/27/statistics-functionality-in-spark.html |
| Statistics Functionality in Apache Spark 1.1 | ["Doris Xin","Burak Yavuz","Xiangrui Meng","Hossein Falaki"] | Hossein Falaki | https://databricks.com/blog/2014/08/27/statistics-functionality-in-spark.html |
| Announcing Apache Spark 1.1 | ["Patrick Wendell"] | Patrick Wendell | https://databricks.com/blog/2014/09/11/announcing-spark-1-1.html |
| Apache Spark 1.1: The State of Spark Streaming | ["Arsalan Tavakoli-Shiraji","Tathagata Das","Patrick Wendell"] | Arsalan Tavakoli-Shiraji | https://databricks.com/blog/2014/09/16/spark-1-1-the-state-of-spark-streaming.html |
| Apache Spark 1.1: The State of Spark Streaming | ["Arsalan Tavakoli-Shiraji","Tathagata Das","Patrick Wendell"] | Tathagata Das | https://databricks.com/blog/2014/09/16/spark-1-1-the-state-of-spark-streaming.html |
| Apache Spark 1.1: The State of Spark Streaming | ["Arsalan Tavakoli-Shiraji","Tathagata Das","Patrick Wendell"] | Patrick Wendell | https://databricks.com/blog/2014/09/16/spark-1-1-the-state-of-spark-streaming.html |
| Apache Spark 1.1: MLlib Performance Improvements | ["Burak Yavuz","Xiangrui Meng"] | Burak Yavuz | https://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html |
| Apache Spark 1.1: MLlib Performance Improvements | ["Burak Yavuz","Xiangrui Meng"] | Xiangrui Meng | https://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html |
| Application Spotlight: Talend | ["Gavin Targonski (Product Management at Talend)"] | Gavin Targonski (Product Management at Talend) | https://databricks.com/blog/2014/09/15/application-spotlight-talend.html |
| Apache Spark 1.1: Bringing Hadoop Input/Output Formats to PySpark | ["Nick Pentreath (Graphflow)","Kan Zhang (IBM)"] | Nick Pentreath (Graphflow) | https://databricks.com/blog/2014/09/17/spark-1-1-bringing-hadoop-inputoutput-formats-to-pyspark.html |
| Apache Spark 1.1: Bringing Hadoop Input/Output Formats to PySpark | ["Nick Pentreath (Graphflow)","Kan Zhang (IBM)"] | Kan Zhang (IBM) | https://databricks.com/blog/2014/09/17/spark-1-1-bringing-hadoop-inputoutput-formats-to-pyspark.html |
| Databricks Reference Applications | ["Vida Ha"] | Vida Ha | https://databricks.com/blog/2014/09/23/databricks-reference-applications.html |
| Databricks and O'Reilly Media launch Certification Program for Apache Spark Developers | ["John Tripier","Paco Nathan"] | John Tripier | https://databricks.com/blog/2014/09/18/databricks-and-oreilly-media-launch-certification-program-for-apache-spark-developers.html |
| Databricks and O'Reilly Media launch Certification Program for Apache Spark Developers | ["John Tripier","Paco Nathan"] | Paco Nathan | https://databricks.com/blog/2014/09/18/databricks-and-oreilly-media-launch-certification-program-for-apache-spark-developers.html |
| Apache Spark Improves the Economics of Video Distribution at NBC Universal | ["Christopher Burdorf (Senior Software Engineer at NBC Universal)"] | Christopher Burdorf (Senior Software Engineer at NBC Universal) | https://databricks.com/blog/2014/09/24/apache-spark-improves-the-economics-of-video-distribution-at-nbc-universal.html |
| Scalable Decision Trees in MLlib | ["Manish Amde (Origami Logic)","Joseph Bradley (Databricks)"] | Manish Amde (Origami Logic) | https://databricks.com/blog/2014/09/29/scalable-decision-trees-in-mllib.html |
| Scalable Decision Trees in MLlib | ["Manish Amde (Origami Logic)","Joseph Bradley (Databricks)"] | Joseph Bradley (Databricks) | https://databricks.com/blog/2014/09/29/scalable-decision-trees-in-mllib.html |
| Guavus Embeds Apache Spark into its Operational Intelligence Platform Deployed at the World’s Largest Telcos | ["Eric Carr (VP Core Systems Group at Guavus)"] | Eric Carr (VP Core Systems Group at Guavus) | https://databricks.com/blog/2014/09/25/guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcos.html |
| Apache Spark as a platform for large-scale neuroscience | ["Jeremy Freeman (Freeman Lab)"] | Jeremy Freeman (Freeman Lab) | https://databricks.com/blog/2014/10/01/spark-as-a-platform-for-large-scale-neuroscience.html |
| Sharethrough Uses Apache Spark Streaming to Optimize Advertisers' Return on Marketing Investment | ["Russell Cardullo (Sharethrough)"] | Russell Cardullo (Sharethrough) | https://databricks.com/blog/2014/10/07/sharethrough-uses-spark-streaming-to-optimize-advertisers-return-on-marketing-investment.html |
| Application Spotlight: Trifacta | ["Sean Kandel (CTO at Trifacta)"] | Sean Kandel (CTO at Trifacta) | https://databricks.com/blog/2014/10/09/application-spotlight-trifacta.html |
| Apache Spark the fastest open source engine for sorting a petabyte | ["Reynold Xin"] | Reynold Xin | https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html |
| Efficient similarity algorithm now in Apache Spark, thanks to Twitter | ["Reza Zadeh"] | Reza Zadeh | https://databricks.com/blog/2014/10/20/efficient-similarity-algorithm-now-in-spark-twitter.html |
| Application Spotlight: Tableau Software | ["Jeff Feng (Product Manager at Tableau Software)"] | Jeff Feng (Product Manager at Tableau Software) | https://databricks.com/blog/2014/10/15/application-spotlight-tableau-software.html |
| Spark Summit East - CFP now open | ["Scott Walent"] | Scott Walent | https://databricks.com/blog/2014/10/23/spark-summit-east-cfp-now-open.html |
| Application Spotlight: Faimdata | ["Ari Himmel (CEO at Faimdata)","Nan Zhu (Chief Architect at Faimdata)"] | Ari Himmel (CEO at Faimdata) | https://databricks.com/blog/2014/10/27/application-spotlight-faimdata.html |
| Application Spotlight: Faimdata | ["Ari Himmel (CEO at Faimdata)","Nan Zhu (Chief Architect at Faimdata)"] | Nan Zhu (Chief Architect at Faimdata) | https://databricks.com/blog/2014/10/27/application-spotlight-faimdata.html |
| Hortonworks: A shared vision for Apache Spark on Hadoop | ["John Kreisa (VP of Strategic Marketing at Hortonworks)"] | John Kreisa (VP of Strategic Marketing at Hortonworks) | https://databricks.com/blog/2014/10/31/hortonworks-a-shared-vision-for-apache-spark-on-hadoop.html |
| Application Spotlight: Skytree Infinity | ["Sachin Chawla (VP of Engineering)"] | Sachin Chawla (VP of Engineering) | https://databricks.com/blog/2014/11/24/application-spotlight-skytree-infinity.html |
| Application Spotlight: Nube Reifier | ["Sonal Goyal (CEO)"] | Sonal Goyal (CEO) | https://databricks.com/blog/2014/12/02/application-spotlight-nube-reifier.html |
| Pearson uses Apache Spark Streaming for next generation adaptive learning platform | [" Dibyendu Bhattacharya (Big Data Architect)"] | Dibyendu Bhattacharya (Big Data Architect) | https://databricks.com/blog/2014/12/08/pearson-uses-spark-streaming-for-next-generation-adaptive-learning-platform.html |
| Apache Spark officially sets a new record in large-scale sorting | ["Reynold Xin"] | Reynold Xin | https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html |
| Application Spotlight: Bedrock | ["Matt MacKinnon (Director of Product Management at Zaloni)"] | Matt MacKinnon (Director of Product Management at Zaloni) | https://databricks.com/blog/2014/11/14/application-spotlight-bedrock.html |
| The Apache Spark Certified Developer Program | ["John Tripier","Paco Nathan"] | John Tripier | https://databricks.com/blog/2014/11/14/the-spark-certified-developer-program.html |
| The Apache Spark Certified Developer Program | ["John Tripier","Paco Nathan"] | Paco Nathan | https://databricks.com/blog/2014/11/14/the-spark-certified-developer-program.html |
| title | authors | author | link |
|---|
Last refresh: Never
databricksBlog2DF = (databricksBlogDF .select("title","authors",explode(col("authors")).alias("author"), "link") .filter(size(col("authors")) > 1) .orderBy("title") ) display(databricksBlog2DF)
| "Learning Spark" book available from O'Reilly | ["Holden Karau","Andy Konwinski","Patrick Wendell","Matei Zaharia"] | Matei Zaharia | https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html |
| "Learning Spark" book available from O'Reilly | ["Holden Karau","Andy Konwinski","Patrick Wendell","Matei Zaharia"] | Holden Karau | https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html |
| "Learning Spark" book available from O'Reilly | ["Holden Karau","Andy Konwinski","Patrick Wendell","Matei Zaharia"] | Andy Konwinski | https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html |
| "Learning Spark" book available from O'Reilly | ["Holden Karau","Andy Konwinski","Patrick Wendell","Matei Zaharia"] | Patrick Wendell | https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html |
| AMPLab updates the Big Data Benchmark | ["Ahir Reddy","Reynold Xin"] | Ahir Reddy | https://databricks.com/blog/2014/02/12/big-data-benchmark.html |
| AMPLab updates the Big Data Benchmark | ["Ahir Reddy","Reynold Xin"] | Reynold Xin | https://databricks.com/blog/2014/02/12/big-data-benchmark.html |
| Announcing Apache Spark Packages | ["Xiangrui Meng","Patrick Wendell"] | Patrick Wendell | https://databricks.com/blog/2014/12/22/announcing-spark-packages.html |
| Announcing Apache Spark Packages | ["Xiangrui Meng","Patrick Wendell"] | Xiangrui Meng | https://databricks.com/blog/2014/12/22/announcing-spark-packages.html |
| Apache Spark 1.1: Bringing Hadoop Input/Output Formats to PySpark | ["Nick Pentreath (Graphflow)","Kan Zhang (IBM)"] | Nick Pentreath (Graphflow) | https://databricks.com/blog/2014/09/17/spark-1-1-bringing-hadoop-inputoutput-formats-to-pyspark.html |
| Apache Spark 1.1: Bringing Hadoop Input/Output Formats to PySpark | ["Nick Pentreath (Graphflow)","Kan Zhang (IBM)"] | Kan Zhang (IBM) | https://databricks.com/blog/2014/09/17/spark-1-1-bringing-hadoop-inputoutput-formats-to-pyspark.html |
| Apache Spark 1.1: MLlib Performance Improvements | ["Burak Yavuz","Xiangrui Meng"] | Burak Yavuz | https://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html |
| Apache Spark 1.1: MLlib Performance Improvements | ["Burak Yavuz","Xiangrui Meng"] | Xiangrui Meng | https://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html |
| Apache Spark 1.1: The State of Spark Streaming | ["Arsalan Tavakoli-Shiraji","Tathagata Das","Patrick Wendell"] | Arsalan Tavakoli-Shiraji | https://databricks.com/blog/2014/09/16/spark-1-1-the-state-of-spark-streaming.html |
| Apache Spark 1.1: The State of Spark Streaming | ["Arsalan Tavakoli-Shiraji","Tathagata Das","Patrick Wendell"] | Tathagata Das | https://databricks.com/blog/2014/09/16/spark-1-1-the-state-of-spark-streaming.html |
| Apache Spark 1.1: The State of Spark Streaming | ["Arsalan Tavakoli-Shiraji","Tathagata Das","Patrick Wendell"] | Patrick Wendell | https://databricks.com/blog/2014/09/16/spark-1-1-the-state-of-spark-streaming.html |
| Apache Spark In MapReduce (SIMR) | ["Ali Ghodsi","Ahir Reddy"] | Ali Ghodsi | https://databricks.com/blog/2014/01/01/simr.html |
| Apache Spark In MapReduce (SIMR) | ["Ali Ghodsi","Ahir Reddy"] | Ahir Reddy | https://databricks.com/blog/2014/01/01/simr.html |
| Apache Spark: A Delight for Developers | ["Jai Ranganathan","Matei Zaharia"] | Matei Zaharia | https://databricks.com/blog/2014/03/20/apache-spark-a-delight-for-developers.html |
| Apache Spark: A Delight for Developers | ["Jai Ranganathan","Matei Zaharia"] | Jai Ranganathan | https://databricks.com/blog/2014/03/20/apache-spark-a-delight-for-developers.html |
| Application Spotlight: Faimdata | ["Ari Himmel (CEO at Faimdata)","Nan Zhu (Chief Architect at Faimdata)"] | Ari Himmel (CEO at Faimdata) | https://databricks.com/blog/2014/10/27/application-spotlight-faimdata.html |
| Application Spotlight: Faimdata | ["Ari Himmel (CEO at Faimdata)","Nan Zhu (Chief Architect at Faimdata)"] | Nan Zhu (Chief Architect at Faimdata) | https://databricks.com/blog/2014/10/27/application-spotlight-faimdata.html |
| Databricks and O'Reilly Media launch Certification Program for Apache Spark Developers | ["John Tripier","Paco Nathan"] | Paco Nathan | https://databricks.com/blog/2014/09/18/databricks-and-oreilly-media-launch-certification-program-for-apache-spark-developers.html |
| Databricks and O'Reilly Media launch Certification Program for Apache Spark Developers | ["John Tripier","Paco Nathan"] | John Tripier | https://databricks.com/blog/2014/09/18/databricks-and-oreilly-media-launch-certification-program-for-apache-spark-developers.html |
| Databricks and the Apache Spark Platform | ["Ion Stoica","Matei Zaharia"] | Ion Stoica | https://databricks.com/blog/2013/10/27/databricks-and-the-apache-spark-platform.html |
| Databricks and the Apache Spark Platform | ["Ion Stoica","Matei Zaharia"] | Matei Zaharia | https://databricks.com/blog/2013/10/27/databricks-and-the-apache-spark-platform.html |
| Databricks to run two massive online courses on Apache Spark | ["Ameet Talwalkar","Anthony Joseph"] | Ameet Talwalkar | https://databricks.com/blog/2014/12/02/announcing-two-spark-based-moocs.html |
| Databricks to run two massive online courses on Apache Spark | ["Ameet Talwalkar","Anthony Joseph"] | Anthony Joseph | https://databricks.com/blog/2014/12/02/announcing-two-spark-based-moocs.html |
| Distributing the Singular Value Decomposition with Apache Spark | ["Li Pu","Reza Zadeh"] | Reza Zadeh | https://databricks.com/blog/2014/07/21/distributing-the-singular-value-decomposition-with-spark.html |
| Distributing the Singular Value Decomposition with Apache Spark | ["Li Pu","Reza Zadeh"] | Li Pu | https://databricks.com/blog/2014/07/21/distributing-the-singular-value-decomposition-with-spark.html |
| Exciting Performance Improvements on the Horizon for Spark SQL | ["Michael Armbrust","Zongheng Yang"] | Michael Armbrust | https://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html |
| Exciting Performance Improvements on the Horizon for Spark SQL | ["Michael Armbrust","Zongheng Yang"] | Zongheng Yang | https://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html |
| ML Pipelines: A New High-Level API for MLlib | ["Xiangrui Meng","Joseph Bradley","Evan Sparks (UC Berkeley)","Shivaram Venkataraman (UC Berkeley)"] | Shivaram Venkataraman (UC Berkeley) | https://databricks.com/blog/2015/01/07/ml-pipelines-a-new-high-level-api-for-mllib.html |
| ML Pipelines: A New High-Level API for MLlib | ["Xiangrui Meng","Joseph Bradley","Evan Sparks (UC Berkeley)","Shivaram Venkataraman (UC Berkeley)"] | Evan Sparks (UC Berkeley) | https://databricks.com/blog/2015/01/07/ml-pipelines-a-new-high-level-api-for-mllib.html |
| ML Pipelines: A New High-Level API for MLlib | ["Xiangrui Meng","Joseph Bradley","Evan Sparks (UC Berkeley)","Shivaram Venkataraman (UC Berkeley)"] | Xiangrui Meng | https://databricks.com/blog/2015/01/07/ml-pipelines-a-new-high-level-api-for-mllib.html |
| ML Pipelines: A New High-Level API for MLlib | ["Xiangrui Meng","Joseph Bradley","Evan Sparks (UC Berkeley)","Shivaram Venkataraman (UC Berkeley)"] | Joseph Bradley | https://databricks.com/blog/2015/01/07/ml-pipelines-a-new-high-level-api-for-mllib.html |
| Making Apache Spark Easier to Use in Java with Java 8 | ["Prashant Sharma","Matei Zaharia"] | Prashant Sharma | https://databricks.com/blog/2014/04/14/spark-with-java-8.html |
| Making Apache Spark Easier to Use in Java with Java 8 | ["Prashant Sharma","Matei Zaharia"] | Matei Zaharia | https://databricks.com/blog/2014/04/14/spark-with-java-8.html |
| Mining Ecommerce Graph Data with Apache Spark at Alibaba Taobao | ["Andy Huang (Alibaba Taobao Data Mining Team)","Wei Wu (Alibaba Taobao Data Mining Team)"] | Andy Huang (Alibaba Taobao Data Mining Team) | https://databricks.com/blog/2014/08/14/mining-graph-data-with-spark-at-alibaba-taobao.html |
| Mining Ecommerce Graph Data with Apache Spark at Alibaba Taobao | ["Andy Huang (Alibaba Taobao Data Mining Team)","Wei Wu (Alibaba Taobao Data Mining Team)"] | Wei Wu (Alibaba Taobao Data Mining Team) | https://databricks.com/blog/2014/08/14/mining-graph-data-with-spark-at-alibaba-taobao.html |
| Random Forests and Boosting in MLlib | ["Joseph K. Bradley (Databricks)","Manish Amde (Origami Logic)"] | Manish Amde (Origami Logic) | https://databricks.com/blog/2015/01/21/random-forests-and-boosting-in-mllib.html |
| Random Forests and Boosting in MLlib | ["Joseph K. Bradley (Databricks)","Manish Amde (Origami Logic)"] | Joseph K. Bradley (Databricks) | https://databricks.com/blog/2015/01/21/random-forests-and-boosting-in-mllib.html |
| Samsung SDS uses Apache Spark for prescriptive analytics at large scale | ["Luis Quintela (Sr. Manager of Big Data Analytics)","Yan Breek (Data Scientist)","Girish Kathalagiri (Data Analytics Engineer)"] | Luis Quintela (Sr. Manager of Big Data Analytics) | https://databricks.com/blog/2014/11/21/samsung-sds-uses-spark-for-prescriptive-analytics-at-large-scale.html |
| Samsung SDS uses Apache Spark for prescriptive analytics at large scale | ["Luis Quintela (Sr. Manager of Big Data Analytics)","Yan Breek (Data Scientist)","Girish Kathalagiri (Data Analytics Engineer)"] | Yan Breek (Data Scientist) | https://databricks.com/blog/2014/11/21/samsung-sds-uses-spark-for-prescriptive-analytics-at-large-scale.html |
| Samsung SDS uses Apache Spark for prescriptive analytics at large scale | ["Luis Quintela (Sr. Manager of Big Data Analytics)","Yan Breek (Data Scientist)","Girish Kathalagiri (Data Analytics Engineer)"] | Girish Kathalagiri (Data Analytics Engineer) | https://databricks.com/blog/2014/11/21/samsung-sds-uses-spark-for-prescriptive-analytics-at-large-scale.html |
| Scalable Collaborative Filtering with Apache Spark MLlib | ["Burak Yavuz","Xiangrui Meng","Reynold Xin"] | Xiangrui Meng | https://databricks.com/blog/2014/07/23/scalable-collaborative-filtering-with-spark-mllib.html |
| Scalable Collaborative Filtering with Apache Spark MLlib | ["Burak Yavuz","Xiangrui Meng","Reynold Xin"] | Reynold Xin | https://databricks.com/blog/2014/07/23/scalable-collaborative-filtering-with-spark-mllib.html |
| Scalable Collaborative Filtering with Apache Spark MLlib | ["Burak Yavuz","Xiangrui Meng","Reynold Xin"] | Burak Yavuz | https://databricks.com/blog/2014/07/23/scalable-collaborative-filtering-with-spark-mllib.html |
| Scalable Decision Trees in MLlib | ["Manish Amde (Origami Logic)","Joseph Bradley (Databricks)"] | Joseph Bradley (Databricks) | https://databricks.com/blog/2014/09/29/scalable-decision-trees-in-mllib.html |
| Scalable Decision Trees in MLlib | ["Manish Amde (Origami Logic)","Joseph Bradley (Databricks)"] | Manish Amde (Origami Logic) | https://databricks.com/blog/2014/09/29/scalable-decision-trees-in-mllib.html |
| Sharethrough Uses Apache Spark Streaming to Optimize Bidding in Real Time | ["Russell Cardullo (Data Infrastructure Engineer at Sharethrough)","Michael Ruggiero (Data Infrastructure Engineer at Sharethrough)"] | Russell Cardullo (Data Infrastructure Engineer at Sharethrough) | https://databricks.com/blog/2014/03/25/sharethrough-and-spark-streaming.html |
| Sharethrough Uses Apache Spark Streaming to Optimize Bidding in Real Time | ["Russell Cardullo (Data Infrastructure Engineer at Sharethrough)","Michael Ruggiero (Data Infrastructure Engineer at Sharethrough)"] | Michael Ruggiero (Data Infrastructure Engineer at Sharethrough) | https://databricks.com/blog/2014/03/25/sharethrough-and-spark-streaming.html |
| Spark SQL: Manipulating Structured Data Using Apache Spark | ["Michael Armbrust","Reynold Xin"] | Michael Armbrust | https://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-data-using-spark-2.html |
| Spark SQL: Manipulating Structured Data Using Apache Spark | ["Michael Armbrust","Reynold Xin"] | Reynold Xin | https://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-data-using-spark-2.html |
| Statistics Functionality in Apache Spark 1.1 | ["Doris Xin","Burak Yavuz","Xiangrui Meng","Hossein Falaki"] | Doris Xin | https://databricks.com/blog/2014/08/27/statistics-functionality-in-spark.html |
| Statistics Functionality in Apache Spark 1.1 | ["Doris Xin","Burak Yavuz","Xiangrui Meng","Hossein Falaki"] | Burak Yavuz | https://databricks.com/blog/2014/08/27/statistics-functionality-in-spark.html |
| Statistics Functionality in Apache Spark 1.1 | ["Doris Xin","Burak Yavuz","Xiangrui Meng","Hossein Falaki"] | Xiangrui Meng | https://databricks.com/blog/2014/08/27/statistics-functionality-in-spark.html |
| Statistics Functionality in Apache Spark 1.1 | ["Doris Xin","Burak Yavuz","Xiangrui Meng","Hossein Falaki"] | Hossein Falaki | https://databricks.com/blog/2014/08/27/statistics-functionality-in-spark.html |
| The Apache Spark Certified Developer Program | ["John Tripier","Paco Nathan"] | John Tripier | https://databricks.com/blog/2014/11/14/the-spark-certified-developer-program.html |
| The Apache Spark Certified Developer Program | ["John Tripier","Paco Nathan"] | Paco Nathan | https://databricks.com/blog/2014/11/14/the-spark-certified-developer-program.html |
| title | authors | author | link |
|---|
Last refresh: Never
%md-sandbox ### Step 1 Starting with the `databricksBlogDF` DataFrame, create a DataFrame called `articlesByMichaelDF` where: 0. Michael Armbrust is the author. 0. The data set contains the column `title` (it may contain others). 0. It contains only one record per article. <img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/> **Hint:** See the Spark documentation on <a href="https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$" target="_blank">built-in functions</a>. <img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/> **Hint:** Include the column `authors` in your view to help you debug your solution.
Last refresh: Never
# TEST - Run this cell to test your solution. from pyspark.sql import Row resultsCount = articlesByMichaelDF.count() dbTest("DF-L5-articlesByMichael-count", 3, resultsCount) results = articlesByMichaelDF.collect() dbTest("DF-L5-articlesByMichael-0", Row(title=u'Spark SQL: Manipulating Structured Data Using Apache Spark'), results[0]) dbTest("DF-L5-articlesByMichael-1", Row(title=u'Exciting Performance Improvements on the Horizon for Spark SQL'), results[1]) dbTest("DF-L5-articlesByMichael-2", Row(title=u'Spark SQL Data Sources API: Unified Data Access for the Apache Spark Platform'), results[2]) print("Tests passed!")
%md ### Step 1 Starting with the `databricksBlogDF` DataFrame, create another DataFrame called `uniqueCategoriesDF` where: 0. The data set contains the one column `category` (and no others). 0. This list of categories should be unique.
Step 1
Starting with the databricksBlogDF DataFrame, create another DataFrame called uniqueCategoriesDF where:
- The data set contains the one column
category(and no others). - This list of categories should be unique.
Last refresh: Never
# TEST - Run this cell to test your solution. resultsCount = uniqueCategoriesDF.count() dbTest("DF-L5-uniqueCategories-count", 12, resultsCount) results = uniqueCategoriesDF.collect() dbTest("DF-L5-uniqueCategories-0", Row(category=u'Announcements'), results[0]) dbTest("DF-L5-uniqueCategories-1", Row(category=u'Apache Spark'), results[1]) dbTest("DF-L5-uniqueCategories-2", Row(category=u'Company Blog'), results[2]) dbTest("DF-L5-uniqueCategories-9", Row(category=u'Platform'), results[9]) dbTest("DF-L5-uniqueCategories-10", Row(category=u'Product'), results[10]) dbTest("DF-L5-uniqueCategories-11", Row(category=u'Streaming'), results[11]) print("Tests passed!")
%md-sandbox ### Step 1 Starting with the `databricksBlogDF` DataFrame, create another DataFrame called `totalArticlesByCategoryDF` where: 0. The new DataFrame contains two columns, `category` and `total`. 0. The `category` column is a single, distinct category (similar to the last exercise). 0. The `total` column is the total number of articles in that category. 0. Order by `category`. <img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Because articles can be tagged with multiple categories, the sum of the totals adds up to more than the total number of articles.
Last refresh: Never
Querying JSON & Hierarchical Data with DataFrames
Apache Spark™ and Azure Databricks® make it easy to work with hierarchical data, such as nested JSON records.
Last refresh: Never